Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable APM profiling for edxapp #749

Open
robrap opened this issue Jul 29, 2024 · 5 comments
Open

Enable APM profiling for edxapp #749

robrap opened this issue Jul 29, 2024 · 5 comments
Assignees

Comments

@robrap
Copy link
Contributor

robrap commented Jul 29, 2024

Ultimately, we want to enable APM profiling for edxapp, when we think it is safe.

Notes:

  • APM Profiling and Data Streaming may be blocked until New Relic is gone, because we think it might interfere with New Relic.
  • Blocked: See Revert "chore: Enable Data streaming and APM profiling edxapp, forum and django service" configuration#30 for revert of the last time we tried this. T
    • This work is blocked by NR being removed from edxapp, because we think that was the problem.
    • When we first tried enabling (both), we got a large latency issue in Production and the auto-scaling group scaled up to max to try to recover.
    • DD external Slack thread about original issue. The 2U Slack thread may be able to be found, if it would be helpful, but I'm guessing it won't because we were just guessing.
  • We need to communicate once this is available.
  • Axim (Dave O and others) are interested in anything we might learn from this feature for edx-platform performance improvements that might go on the roadmap.
@robrap robrap added this to Arch-BOM Jul 29, 2024
@robrap robrap converted this from a draft issue Jul 29, 2024
@robrap robrap moved this to Backlog in Arch-BOM Jul 29, 2024
@robrap robrap changed the title Enable APM profiling for edxapp [Post-NR] Enable APM profiling for edxapp Aug 21, 2024
@robrap robrap changed the title [Post-NR] Enable APM profiling for edxapp Enable APM profiling for edxapp Aug 21, 2024
@robrap
Copy link
Contributor Author

robrap commented Oct 16, 2024

We should roll out to Stage, then Edge, then Prod.

@jristau1984 jristau1984 moved this from Backlog to Ready For Development in Arch-BOM Oct 21, 2024
@dianakhuang dianakhuang self-assigned this Oct 21, 2024
@dianakhuang dianakhuang moved this from Ready For Development to In Progress in Arch-BOM Oct 21, 2024
@timmc-edx
Copy link
Member

DD support ticket for latency issues we encountered during the most recent rollout attempt: https://help.datadoghq.com/hc/requests/1909564

dianakhuang added a commit to edx/configuration that referenced this issue Nov 12, 2024
It seems like the newer version might be more efficient,
so we should switch to using it.

edx/edx-arch-experiments#749
@timmc-edx
Copy link
Member

timmc-edx commented Nov 25, 2024

I think I've managed to repro slow gunicorn startup on a sandbox instance.

Profiling setup

Added to /edx/app/edxapp/lms.sh and worker.sh (though the latter shouldn't matter for gunicorn):

export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true
export DD_PROFILING_TIMELINE_ENABLED=true

And then:

/edx/bin/supervisorctl restart lms

(Can also restart workers with /edx/bin/supervisorctl restart edxapp_worker:lms_high_1 edxapp_worker:lms_high_mem_1 edxapp_worker:lms_default_1 but not needed for gunicorn experiment.)

To get DD profiling data on both sides, pushed buttons in instructor dashboard and made calls to https://timmc.sandbox.edx.org/heartbeat?extended -- data shows up under host:timmc (no env tag is set, unfortunately).

Gunicorn repro

In a dev terminal, make short HTTP calls to the LMS 1-2 times per second: while true; do curl -sS "https://timmc.sandbox.edx.org/heartbeat" -m1; sleep 0.5; done and wait about 10 seconds before proceeding with data-gathering. This can be left on continuously and does not need to be stopped between tests.

For each config:

  • Edit the LMS configuration.
  • In a root sandbox terminal, restart the LMS with /edx/bin/supervisorctl restart lms and wait about 10 seconds.
  • For each iteration:
    • Restart LMS and pay attention to when the lms: started message appears to help orient yourself in the nginx logs.
    • In /edx/var/log/nginx/access.log, record the startup time as indicated in the Evaluation section below.
    • Wait about 30 seconds
  • Perform 3 iterations of this to get enough samples (depending on observed variance).

nginx output will look something like this:

3.220.104.68 - - [25/Nov/2024:20:10:59 +0000] "GET /heartbeat HTTP/1.1" 200 122 0.021 "-" "curl/8.5.0" "-" - 0aff3657b83c7779b9a48d87ad185c60
3.220.104.68 - - [25/Nov/2024:20:11:00 +0000] "GET /heartbeat HTTP/1.1" 200 122 0.022 "-" "curl/8.5.0" "-" - 7f8c15ddced976c7199af3f53bde94d3
3.220.104.68 - - [25/Nov/2024:20:11:01 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.824 "-" "curl/8.5.0" "-" - 4d985f61d866cc4b470ac3dbd30b5d2d
3.220.104.68 - - [25/Nov/2024:20:11:03 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.879 "-" "curl/8.5.0" "-" - 173a7f0d30a37266976f45c36205fb37
3.220.104.68 - - [25/Nov/2024:20:11:04 +0000] "GET /heartbeat HTTP/1.1" 503 5416 0.506 "-" "curl/8.5.0" "-" - 0475736b183c4d12b8aaf294a94af859
3.220.104.68 - - [25/Nov/2024:20:11:05 +0000] "GET /heartbeat HTTP/1.1" 503 5416 0.028 "-" "curl/8.5.0" "-" - 0e4b1d04ac591ef74044156e14b11535
3.220.104.68 - - [25/Nov/2024:20:11:05 +0000] "GET /heartbeat HTTP/1.1" 503 5416 0.026 "-" "curl/8.5.0" "-" - d57d0b1f8d67ae64a1c46d53415d6d7b
3.220.104.68 - - [25/Nov/2024:20:11:07 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.842 "-" "curl/8.5.0" "-" - 87c9f84bc24ab93d19dd08b2f7ff3bc9
3.220.104.68 - - [25/Nov/2024:20:11:09 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.884 "-" "curl/8.5.0" "-" - 82a15a9f7e6feb753a0d40741236af89
3.220.104.68 - - [25/Nov/2024:20:11:10 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.860 "-" "curl/8.5.0" "-" - f389d7c7fd36ee272b1781f18d0f96da
3.220.104.68 - - [25/Nov/2024:20:11:12 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.842 "-" "curl/8.5.0" "-" - bb414172da1c9114cff47db64b4a9adb
3.220.104.68 - - [25/Nov/2024:20:11:13 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.848 "-" "curl/8.5.0" "-" - e8bc229b5746b14b11386db26be4e2ef
3.220.104.68 - - [25/Nov/2024:20:11:15 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.830 "-" "curl/8.5.0" "-" - 7a9170e8c2e447af974d1a6f21192d76
3.220.104.68 - - [25/Nov/2024:20:11:16 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.378 "-" "curl/8.5.0" "-" - b3fc5ba259efe44223fff8984d15e9df
3.220.104.68 - - [25/Nov/2024:20:11:18 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.829 "-" "curl/8.5.0" "-" - 066e305fe45a453c7ffc9f7dd5c67163
3.220.104.68 - - [25/Nov/2024:20:11:19 +0000] "GET /heartbeat HTTP/1.1" 200 122 0.207 "-" "curl/8.5.0" "-" - 35a8413f354269347e6a3dca170eb5f8
3.220.104.68 - - [25/Nov/2024:20:11:19 +0000] "GET /heartbeat HTTP/1.1" 200 122 0.019 "-" "curl/8.5.0" "-" - cb087a69b565042aa3dc6f648e4f4736

The initial transition of 200 -> 499 and then 499 -> 503 occurs during LMS shutdown. 503 -> 499 transition co-occurs with the lms: started message from the supervisor, and 499 -> 200 is when curl starts getting responses again.

For comparison, here's /edx/var/log/supervisor/lms-stderr.log around that time period:

[2024-11-25 20:11:06 +0000] [973414] [INFO] Starting gunicorn 23.0.0
[2024-11-25 20:11:06 +0000] [973414] [INFO] Listening at: http://127.0.0.1:8000 (973414)
[2024-11-25 20:11:06 +0000] [973414] [INFO] Using worker: sync
[2024-11-25 20:11:06 +0000] [973422] [INFO] Booting worker with pid: 973422
[2024-11-25 20:11:06 +0000] [973426] [INFO] Booting worker with pid: 973426
[2024-11-25 20:11:16 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:17 +0000] [973422] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:19 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:19 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:19 +0000] [973422] [INFO] GET /heartbeat
[2024-11-25 20:11:20 +0000] [973422] [INFO] GET /heartbeat
[2024-11-25 20:11:21 +0000] [973422] [INFO] GET /heartbeat
[2024-11-25 20:11:22 +0000] [973422] [INFO] GET /heartbeat

In this sample, it appears that those calls that were recorded as a 499 did eventually get received by the LMS and were all processed in a burst about 10 seconds after workers actually started.

Evaluation

After the 503s end: Find the number of seconds from the first 499 to the first 200. This is the "startup period".

Profiling off

With profiling off, the startup period lasts 12 seconds.

Profiling on

With the below profiling config, the startup period lasts 20 seconds.

export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true
export DD_PROFILING_TIMELINE_ENABLED=true

@timmc-edx
Copy link
Member

Additional configurations to establish a baseline:


export DD_PROFILING_ENABLED=true

19 seconds (with one 499 a few seconds after the first 200s); 18; 18


export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true

21; 22; 21


Pretty consistent.

I'll keep this disabled for now since it's not needed for repro, and since we'll probably only want to use it when we want to actually look at the generated profiles:

  • DD_PROFILING_TIMELINE_ENABLED

To experiment with:

  • DD_PROFILING_API_TIMEOUT which defaults to 10 seconds -- try shortening this to 1 second.
  • DD_PROFILING_CAPTURE_PCT defaults to 1.0 -- try 0.1 or 10.
  • These default to true; try turning them off:
    • DD_PROFILING_ENABLE_CODE_PROVENANCE
    • DD_PROFILING_ENDPOINT_COLLECTION_ENABLED
    • DD_PROFILING_STACK_ENABLED
    • DD_PROFILING_MEMORY_ENABLED
    • DD_PROFILING_LOCK_ENABLED
    • DD_PROFILING_LOCK_NAME_INSPECT_DIR (only relevant when lock profiling is enabled)
    • DD_PROFILING_HEAP_ENABLED
  • These are off by default; try turning them on:
    • DD_PROFILING_STACK_V2_ENABLED (only relevant stack profiling is enabled)
    • DD_PROFILING_EXPORT_LIBDD_ENABLED

@timmc-edx
Copy link
Member

timmc-edx commented Nov 25, 2024

With a baseline of export DD_PROFILING_ENABLED=true and export DD_PROFILING_STACK_V2_ENABLED=true (since v2 is what DD wants everyone to switch to soon anyhow)...

  • DD_PROFILING_API_TIMEOUT=1: 19, 20, 21
  • DD_PROFILING_CAPTURE_PCT=0.1: 21, 19, 21

On to the toggles...


Turning every profiling feature off (except for profiling itself) gets to the "good" situation:

export DD_PROFILING_STACK_ENABLED=false
export DD_PROFILING_MEMORY_ENABLED=false
export DD_PROFILING_HEAP_ENABLED=false
export DD_PROFILING_ENABLE_CODE_PROVENANCE=false
export DD_PROFILING_ENDPOINT_COLLECTION_ENABLED=false
export DD_PROFILING_LOCK_ENABLED=false

11, 9, 11


export DD_PROFILING_ENABLE_CODE_PROVENANCE=false
export DD_PROFILING_ENDPOINT_COLLECTION_ENABLED=false
export DD_PROFILING_LOCK_ENABLED=false

19, 19


export DD_PROFILING_STACK_ENABLED=false
export DD_PROFILING_MEMORY_ENABLED=false
export DD_PROFILING_HEAP_ENABLED=false

12, 11, 11


export DD_PROFILING_STACK_ENABLED=false

18, 17, 17


export DD_PROFILING_HEAP_ENABLED=false

16, 16, 16


export DD_PROFILING_MEMORY_ENABLED=false

14, 13, 13


export DD_PROFILING_MEMORY_ENABLED=false
export DD_PROFILING_HEAP_ENABLED=false

13, 15, 15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

3 participants