Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Log ancestors of current span when anomaly found #796

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

timmc-edx
Copy link
Member

Datadog believes that a lingering unfinished celery span from a previous request might be becoming the active context for all future Django requests, causing them to become parented to it. To gather evidence, we'll need to learn not just what the trace root span is, but whether there are unexpected intermediary spans as well.

There may be a great many ancestors, especially with the trace concatenation in play, so we'll walk upwards from the current span towards the root until a limit is reached. This should allow us to find if the current Django request's span has a parent, and what it is.

  • New setting DATADOG_DIAGNOSTICS_LOG_SPAN_DEPTH to control walk depth
  • Move anomaly logging to its own method.
  • Log ancestry of N proximal spans, and adjust formatting of logging to be more compact.

Changes relating to testing:

  • Add comment on how to do manual testing -- calling finish() on the root span lets us fake an anomaly.
  • Lift current_span to early in log_diagnostics. This is helpful during manual testing when DD_DJANGO_INSTRUMENT_MIDDLEWARE=false, because in that case, the current span is the root span. But finishing that span means we can no longer call current_root_span to retrieve it! Retrieving it before that point allows this trick to still work with our code.
  • Need to move middleware setup out of setUp method because we need to be able to adjust Django settings before the middleware is constructed. run_middleware now takes an optional middleware param.
  • Add intermediary span in unit tests so that truncation can be more clearly observed, and to help demonstrate that walk and print order is as expected.
  • Rename test local-root span to be in line with other naming. Don't need the "test_" prefix.

Merge checklist:
Check off if complete or not applicable:

  • Version bumped
  • Changelog record added
  • Documentation updated (not only docstrings)
  • Fixup commits are squashed away
  • Unit tests added/updated
  • Manual testing instructions provided
  • Noted any: Concerns, dependencies, migration issues, deadlines, tickets

Datadog believes that a lingering unfinished celery span from a previous
request might be becoming the active context for all future Django
requests, causing them to become parented to it. To gather evidence, we'll
need to learn not just what the trace root span is, but whether there are
unexpected intermediary spans as well.

There may be a great many ancestors, especially with the trace
concatenation in play, so we'll walk upwards from the current span towards
the root until a limit is reached. This should allow us to find if the
current Django request's span has a parent, and what it is.

- New setting `DATADOG_DIAGNOSTICS_LOG_SPAN_DEPTH` to control walk depth
- Move anomaly logging to its own method.
- Log ancestry of N proximal spans, and adjust formatting of logging to be
  more compact.

Changes relating to testing:

- Add comment on how to do manual testing -- calling finish() on the root
  span lets us fake an anomaly.
- Lift `current_span` to early in `log_diagnostics`. This is helpful during
  manual testing when `DD_DJANGO_INSTRUMENT_MIDDLEWARE=false`, because in
  that case, the current span *is* the root span. But finishing that span
  means we can no longer call current_root_span to retrieve it! Retrieving
  it before that point allows this trick to still work with our code.
- Need to move middleware setup out of `setUp` method because we need to be
  able to adjust Django settings before the middleware is constructed.
  `run_middleware` now takes an optional `middleware` param.
- Add intermediary span in unit tests so that truncation can be more
  clearly observed, and to help demonstrate that walk and print order is
  as expected.
- Rename test local-root span to be in line with other naming. Don't need
  the "test_" prefix.
@timmc-edx
Copy link
Member Author

Manual testing: Uncommented local_root_span.finish() line, installed in devstack (with DD enabled), enabled Waffle flag. Made request, observed the below:

2024-09-10 15:05:47,107 WARNING 143708 [edx_arch_experiments.datadog_diagnostics.middleware] [user None] [ip 172.18.0.1] middleware.py:138 - Anomalous Datadog local root span: trace_id=66e0604a00000000ee9cfc269f8d6eff; duration=0.156; worker_age=44.477; span ancestry:
(ancestors truncated)
name='django.middleware' id=178179805985564339 trace_id=136746283040579197058094726904925744895 parent_id=14741004767024480286 service='django' resource='edx_django_utils.cache.middleware.TieredCacheMiddleware.__call__' type=None start=1725980747.101027 end=None duration=None error=0 tags={'component': 'django'} metrics={} links='' events=''
name='django.middleware' id=6537268979827162269 trace_id=136746283040579197058094726904925744895 parent_id=178179805985564339 service='django' resource='edx_rest_framework_extensions.middleware.RequestCustomAttributesMiddleware.__call__' type=None start=1725980747.102546 end=None duration=None error=0 tags={'component': 'django'} metrics={} links='' events=''
name='django.middleware' id=303177917688351545 trace_id=136746283040579197058094726904925744895 parent_id=6537268979827162269 service='django' resource='edx_rest_framework_extensions.auth.jwt.middleware.EnsureJWTAuthSettingsMiddleware.__call__' type=None start=1725980747.1029794 end=None duration=None error=0 tags={'component': 'django'} metrics={} links='' events=''
name='django.middleware' id=3477580412824992983 trace_id=136746283040579197058094726904925744895 parent_id=303177917688351545 service='django' resource='simple_history.middleware.HistoryRequestMiddleware' type=None start=1725980747.1031067 end=None duration=None error=0 tags={'component': 'django'} metrics={} links='' events=''
name='django.middleware' id=4819840310340803460 trace_id=136746283040579197058094726904925744895 parent_id=3477580412824992983 service='django' resource='openedx.core.djangoapps.site_configuration.middleware.SessionCookieDomainOverrideMiddleware.__call__' type=None start=1725980747.1032326 end=None duration=None error=0 tags={'component': 'django'} metrics={} links='' events=''
name='django.middleware' id=8409666469857767230 trace_id=136746283040579197058094726904925744895 parent_id=4819840310340803460 service='django' resource='lms.djangoapps.discussion.django_comment_client.utils.QueryCountDebugMiddleware.__call__' type=None start=1725980747.1033456 end=None duration=None error=0 tags={'component': 'django'} metrics={} links='' events=''
name='django.middleware' id=8794748453866754632 trace_id=136746283040579197058094726904925744895 parent_id=8409666469857767230 service='django' resource='debug_toolbar.middleware.DebugToolbarMiddleware.__call__' type=None start=1725980747.103461 end=None duration=None error=0 tags={'component': 'django'} metrics={} links='' events=''
name='django.middleware' id=5512647152664412862 trace_id=136746283040579197058094726904925744895 parent_id=8794748453866754632 service='django' resource='edx_arch_experiments.datadog_diagnostics.middleware.DatadogDiagnosticMiddleware.__call__' type=None start=1725980747.1040952 end=None duration=None error=0 tags={'component': 'django'} metrics={} links='' events=''
name='django.middleware' id=323592458807465593 trace_id=136746283040579197058094726904925744895 parent_id=5512647152664412862 service='django' resource='common.djangoapps.third_party_auth.middleware.ExceptionMiddleware.__call__' type=None start=1725980747.1042178 end=None duration=None error=0 tags={'component': 'django'} metrics={} links='' events=''
name='django.middleware' id=7770971294394850867 trace_id=136746283040579197058094726904925744895 parent_id=323592458807465593 service='django' resource='edx_arch_experiments.datadog_diagnostics.middleware.DatadogDiagnosticMiddleware.process_view' type=None start=1725980747.105409 end=None duration=None error=0 tags={'component': 'django'} metrics={} links='' events=''

When LMS is started with DD_DJANGO_INSTRUMENT_MIDDLEWARE=false (as in Production) I instead get:

2024-09-10 15:07:06,907 WARNING 144199 [edx_arch_experiments.datadog_diagnostics.middleware] [user None] [ip 172.18.0.1] middleware.py:138 - Anomalous Datadog local root span: trace_id=66e0609a00000000007bc722383300ea; duration=0.123; worker_age=7.436; span ancestry:
name='django.request' id=9614315245991389650 trace_id=136746289378832198182082680040153350378 parent_id=None service='django' resource='__django_request' type='web' start=1725980826.7844098 end=1725980826.9072769 duration=0.12286707 error=0 tags={'DEFAULT_HASHING_ALGORITHM': 'sha256', '_dd.base_service': '', '_dd.p.dm': '-0', '_dd.p.tid': '66e0609a00000000', 'component': 'django', 'django.request.class': 'django.core.handlers.wsgi.WSGIRequest', 'django_version': '4.2.14', 'has_jwt_cookie': 'False', 'http.method': 'GET', 'ip_chain.external.types': 'priv', 'ip_chain.raw': '172.18.0.1', 'ip_chain.safest_client_ip': '172.18.0.1', 'ip_chain.types': 'priv', 'language': 'python', 'python_version': '3.11.9', 'runtime-id': 'fe774bf3856d40758c4f3a63b4745a64', 'span.kind': 'server'} metrics={'_dd.measured': 1, '_dd.top_level': 1, '_dd.tracer_kr': 1.0, '_sampling_priority_v1': 1, 'cookies.header.size': 0, 'ip_chain.count': 1, 'ip_chain.external.count': 1, 'process_id': 144199, 'safe_sessions.session_cookie_count': 0} links='' events=''

When deployed, we're expecting something like the latter, but with additional span lines above the django.request one.

@timmc-edx timmc-edx merged commit 25aed4b into main Sep 10, 2024
6 of 7 checks passed
@timmc-edx timmc-edx deleted the timmc/anomaly-more branch September 10, 2024 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants