Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance on Span.set_status #385

Closed
calvin-kargo opened this issue Apr 27, 2022 · 1 comment
Closed

Slow performance on Span.set_status #385

calvin-kargo opened this issue Apr 27, 2022 · 1 comment

Comments

@calvin-kargo
Copy link
Contributor

Hi, i want to raise an issue about opentelemetry erlang performance on Span.set_status, but i myself was unable to reproduce it on local environment.
Recently one of production service on my company experiencing high timeout/slow API latency when it's error, which is strange because error is usually handled faster (via let it crash) then success response.
Upon further investigation and debugging on production instance (it doesn't happen on local), turn out that one of otel instrumentation library was causing this really high slow down, from ~5ms response to ~1s.
After even more investigation, it was occuring due to OpenTelemetry.Span.set_status method, and no other! here's the telemetry code:

  def handle_operation_stop(_event_name, _measurements, data, config) do
    errors = data.blueprint.result[:errors]

    result_attributes =
      []
      |> put_if(
        config.trace_response_result,
        {"graphql.response.result", Jason.encode!(data.blueprint.result)}
      )
      |> put_if(
        config.trace_response_errors,
        {"graphql.response.errors", Jason.encode!(errors)}
      )

    ctx = OpentelemetryTelemetry.set_current_telemetry_span(@tracer_id, data)

    case errors do
      nil -> :ok
      [] -> :ok
      _ -> Span.set_status(ctx, OpenTelemetry.status(:error, ""))
    end

    Span.set_attributes(ctx, result_attributes)

    OpentelemetryTelemetry.end_telemetry_span(@tracer_id, data)
  end

Only after commenting out Span.set_status could i fix the performance issue. I double check it multiple time and try commenting other part of code too! just to ensure it's really Span.set_status which are causing the issue and no other.

I tried to reproduce this locally, but i was unable to reproduce this anomaly

here's my application spec:
Elixir 1.13.3, erlang/OTP 24

Opentelemetry dependency:
opentelemetry, "1.0.2"
opentelemetry_api, "1.0.2"
opentelemetry_telemetry, "1.0.0-beta.7"

additional observation i could gather that it only happen on production instance (which are handling a lot of request), in the instance when this happen most of the time the memory usage is high and dominated by ets

trying to debug issue from source code, i could only spot that on https://github.com/open-telemetry/opentelemetry-erlang/blob/main/apps/opentelemetry/src/otel_span_ets.erl#L140 , set_status is using ets:select_replace, and set_attributes is using ets:update_element. from what i could gather on https://www.erlang.org/doc/man/ets.html#select_replace-2 select replace could traverse the whole ets table, which might result in O(n) complexity compared to direct update which might be O(1), but i am unfamiliar with ets function used so i am not sure what i understand is correct.

the application used default config of opentelemetry (generally no set config), with custom http endpoint (edited)

@tsloughter
Copy link
Member

Thanks, this is resolved in #384

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants