Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds USDT trace points for supported distros #366

Closed
wants to merge 1 commit into from

Conversation

tgross
Copy link

@tgross tgross commented Jan 20, 2019

Python supports user-level statically defined tracing (USDT) for platforms that support it. This has historically been not compiled-in by default in Linux distributions because only hosts with systemtap can take advantage of it. Newer Linux kernels (including all kernels supported by Docker) have eBPF, and this allows for all users to take advantage of tools like pythonflow, pythongc, etc. Including this option would have the happy side-effect of allowing for the USDT probes to be used for DTrace on platforms where Docker containers can be run on Illumos (ex. Joyent's Triton).

The upstream implementation is a little weird (ref https://docs.python.org/3/howto/instrumentation.html). The --with-dtrace flag expects the dtrace binary to exist so that appropriate header files can be written. The systemtap-sdt-dev package for Debian provides a mock "dtrace binary" (actually a Python script) which can generate the header files.

This package or one like it is not available for Alpine. I have another branch with an experiment to vendor the header files and the mock dtrace binary from the Debian systemtap-sdt-dev package. This works inasmuch as the header gets created and I've verified the USDT probes work, but it's a totally gross approach and perhaps dubious from a licensing standpoint so I didn't include it in this PR. But I'd love to have a discussion about what the preferred approach would be.

@yosifkit
Copy link
Member

Just to put some stats down so we have a couple comparisons:

  • on disk size of python:3.7-stretch: 925MB -> 927MB
  • on disk size of python:3.7-slim-stretch: 142MB -> 143MB
  • total build time (master on travis vs PR): 2 hrs 8 min -> 2 hrs 27 min
    • this should be similar to the official build infrastructure

Overall, it might be worth the extra build time and small bump in size, tracing is nice to have. Is there a comparison of python performance before and after this flag?

On the other branch, I'm not sure vendoring in the Debian scripts for Alpine is the right way; that could dramatically increase the maintenance in this repo and distracts from its simplicity. Maybe we just skip Alpine for now and work to get the package into Alpine proper?

@tgross
Copy link
Author

tgross commented Jan 22, 2019

Is there a comparison of python performance before and after this flag?

I can put together a benchmark. Will have that in the next day or two.

Maybe we just skip Alpine for now and work to get the package into Alpine proper?

Seems like a reasonable approach.

@yosifkit
Copy link
Member

Thanks @tgross!

CC @tianon, I think this is fine if there isn't a significant performance hit; what do you think?

@yosifkit yosifkit requested a review from tianon January 22, 2019 22:09
@@ -18,6 +18,7 @@ ENV PYTHONIOENCODING UTF-8
# extra dependencies (over what buildpack-deps already includes)
RUN apt-get update && apt-get install -y --no-install-recommends \
tk-dev \
systemtap-sdt-dev \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this provide any user-facing benefit when we keep it installed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's a miss. Will fix.

@tianon
Copy link
Member

tianon commented Jan 25, 2019

Seems generally sane, although I'm definitely curious to see the results of that benchmark. 👍

This is definitely supported for Python 2.x and 3.x? (https://docs.python.org/2/howto/instrumentation.html is a 404, so I want to make sure Python 2 isn't just quietly ignoring the ./configure flag 😄)

More concretely: https://travis-ci.org/docker-library/python/jobs/481912534#L1050

configure: WARNING: unrecognized options: --with-dtrace

(So I think this should definitely be removed from all the -caveman files too. 👍)

@tgross
Copy link
Author

tgross commented Feb 2, 2019

Sorry about the delay in replying.

I looked into the Python2 thing and it looks like it's not supported in upstream for Linux, only as patches for other Unixes. So we can remove that. I'm going to admit I'm not sure what the -caveman files are but I've removed it from there as requested. I've rebased this PR on master and pushed up the new commit.

I'm currently running Python's own benchmarking suite against both python:3.7-stretch and the one that I've built using this PR. I'm running the suite without running attaching any traces to the --with-dtrace container, as we want to make sure that the promise of "no runtime overhead when disabled" proves true. I'll also run a partial set of benchmarks with a trace enabled just because it might be interesting.

Not sure how long this benchmark takes to run but I'm just running it on a machine overnight and I'll post again here tomorrow with results.

@tgross
Copy link
Author

tgross commented Feb 3, 2019

The python benchmark suite I ran exercises 60 real-world applications. It takes a bit over 20 min to run on my lab machine, so I ran this a few times to verify the results I got here.

Bottom line: although many of the benchmarks don't show a significant difference between --with-dtrace and not, 26 of the 60 tests show a 5%-17% performance hit.

Needless to say I'm surprised and disappointed by this. Keep in mind this benchmark was run without an active trace, so I would not have expected the no-op function wrapping to create such a high level of overhead. I'm definitely glad I did the benchmarks, because accepting this PR would result in a significant community-wide hit. I'm going to withdraw this PR.

There's another approach worth exploring for eBPF tracing of the interpreter without compiling-in usdt, so I may try to tackle publishing a reproducible method for that for folks to use. But that's not specific to Docker.

Sorry for wasting your time @yosifkit & @tianon! 😊 But at least we have this data now for the next time someone suggests this.

Benchmark results
python-no-dtrace.json
=====================

Performance version: 0.7.0
Report on Linux-4.15.0-30-generic-x86_64-with-debian-9.6
Number of logical CPUs: 4
Start date: 2019-02-03 00:14:21.780332
End date: 2019-02-03 00:37:29.501749

python-dtrace-not-tracing.json
==============================

Performance version: 0.7.0
Report on Linux-4.15.0-30-generic-x86_64-with-debian-9.6
Number of logical CPUs: 4
Start date: 2019-02-03 00:37:32.906533
End date: 2019-02-03 01:01:19.249629

### 2to3 ###
Mean +- std dev: 495 ms +- 4 ms -> 537 ms +- 30 ms: 1.08x slower
Significant (t=-10.61)

### chameleon ###
Mean +- std dev: 15.8 ms +- 0.2 ms -> 17.7 ms +- 1.0 ms: 1.12x slower
Significant (t=-15.00)

### chaos ###
Mean +- std dev: 188 ms +- 1 ms -> 216 ms +- 8 ms: 1.15x slower
Significant (t=-25.94)

### crypto_pyaes ###
Mean +- std dev: 170 ms +- 2 ms -> 198 ms +- 12 ms: 1.17x slower
Significant (t=-18.29)

### deltablue ###
Mean +- std dev: 12.7 ms +- 0.2 ms -> 13.8 ms +- 0.4 ms: 1.09x slower
Significant (t=-20.25)

### django_template ###
Mean +- std dev: 214 ms +- 3 ms -> 251 ms +- 12 ms: 1.17x slower
Significant (t=-22.82)

### dulwich_log ###
Mean +- std dev: 113 ms +- 1 ms -> 116 ms +- 3 ms: 1.03x slower
Significant (t=-10.72)

### fannkuch ###
Mean +- std dev: 767 ms +- 78 ms -> 835 ms +- 40 ms: 1.09x slower
Significant (t=-6.03)

### float ###
Mean +- std dev: 176 ms +- 6 ms -> 185 ms +- 8 ms: 1.05x slower
Significant (t=-6.36)

### genshi_text ###
Mean +- std dev: 50.9 ms +- 1.4 ms -> 53.5 ms +- 2.6 ms: 1.05x slower
Significant (t=-7.03)

### genshi_xml ###
Mean +- std dev: 114 ms +- 6 ms -> 118 ms +- 7 ms: 1.04x slower
Significant (t=-3.46)

### go ###
Mean +- std dev: 467 ms +- 23 ms -> 478 ms +- 27 ms: 1.02x slower
Significant (t=-2.35)

### hexiom ###
Mean +- std dev: 17.8 ms +- 0.5 ms -> 19.5 ms +- 1.3 ms: 1.10x slower
Significant (t=-9.53)

### html5lib ###
Mean +- std dev: 153 ms +- 6 ms -> 160 ms +- 7 ms: 1.05x slower
Significant (t=-6.35)

### json_dumps ###
Mean +- std dev: 20.0 ms +- 1.1 ms -> 20.9 ms +- 1.2 ms: 1.04x slower
Significant (t=-4.12)

### json_loads ###
Mean +- std dev: 41.9 us +- 0.9 us -> 44.9 us +- 1.6 us: 1.07x slower
Significant (t=-12.83)

### logging_format ###
Mean +- std dev: 19.3 us +- 1.1 us -> 20.2 us +- 1.4 us: 1.04x slower
Significant (t=-3.65)

### logging_silent ###
Mean +- std dev: 331 ns +- 19 ns -> 345 ns +- 11 ns: 1.04x slower
Significant (t=-5.20)

### logging_simple ###
Mean +- std dev: 16.8 us +- 0.4 us -> 17.4 us +- 0.6 us: 1.03x slower
Significant (t=-6.19)

### mako ###
Mean +- std dev: 31.0 ms +- 0.7 ms -> 36.2 ms +- 2.2 ms: 1.17x slower
Significant (t=-17.16)

### meteor_contest ###
Mean +- std dev: 151 ms +- 3 ms -> 161 ms +- 8 ms: 1.07x slower
Significant (t=-9.27)

### nbody ###
Mean +- std dev: 199 ms +- 8 ms -> 204 ms +- 13 ms: 1.02x slower
Significant (t=-2.52)

### nqueens ###
Mean +- std dev: 172 ms +- 11 ms -> 174 ms +- 7 ms: 1.01x slower
Not significant

### pathlib ###
Mean +- std dev: 46.5 ms +- 2.2 ms -> 45.9 ms +- 1.8 ms: 1.01x faster
Not significant

### pickle ###
Mean +- std dev: 16.9 us +- 0.7 us -> 16.7 us +- 0.9 us: 1.01x faster
Not significant

### pickle_dict ###
Mean +- std dev: 50.0 us +- 2.2 us -> 51.2 us +- 2.1 us: 1.02x slower
Significant (t=-3.02)

### pickle_list ###
Mean +- std dev: 6.78 us +- 0.34 us -> 6.92 us +- 0.20 us: 1.02x slower
Significant (t=-2.78)

### pickle_pure_python ###
Mean +- std dev: 831 us +- 25 us -> 883 us +- 30 us: 1.06x slower
Significant (t=-10.25)

### pidigits ###
Mean +- std dev: 245 ms +- 14 ms -> 250 ms +- 12 ms: 1.02x slower
Significant (t=-2.10)

### python_startup ###
Mean +- std dev: 11.9 ms +- 0.1 ms -> 11.8 ms +- 0.1 ms: 1.00x faster
Not significant

### python_startup_no_site ###
Mean +- std dev: 8.22 ms +- 0.02 ms -> 8.17 ms +- 0.01 ms: 1.01x faster
Not significant

### raytrace ###
Mean +- std dev: 966 ms +- 91 ms -> 1031 ms +- 57 ms: 1.07x slower
Significant (t=-4.72)

### regex_compile ###
Mean +- std dev: 336 ms +- 20 ms -> 323 ms +- 11 ms: 1.04x faster
Significant (t=4.28)

### regex_dna ###
Mean +- std dev: 219 ms +- 5 ms -> 225 ms +- 13 ms: 1.03x slower
Significant (t=-3.46)

### regex_effbot ###
Mean +- std dev: 4.01 ms +- 0.14 ms -> 4.42 ms +- 0.31 ms: 1.10x slower
Significant (t=-9.38)

### regex_v8 ###
Mean +- std dev: 32.8 ms +- 0.9 ms -> 34.4 ms +- 1.0 ms: 1.05x slower
Significant (t=-9.27)

### richards ###
Mean +- std dev: 145 ms +- 12 ms -> 143 ms +- 5 ms: 1.02x faster
Not significant

### scimark_fft ###
Mean +- std dev: 544 ms +- 27 ms -> 572 ms +- 32 ms: 1.05x slower
Significant (t=-4.95)

### scimark_lu ###
Mean +- std dev: 312 ms +- 19 ms -> 335 ms +- 17 ms: 1.07x slower
Significant (t=-6.94)

### scimark_monte_carlo ###
Mean +- std dev: 180 ms +- 5 ms -> 190 ms +- 5 ms: 1.05x slower
Significant (t=-9.80)

### scimark_sor ###
Mean +- std dev: 363 ms +- 21 ms -> 378 ms +- 22 ms: 1.04x slower
Significant (t=-3.91)

### scimark_sparse_mat_mult ###
Mean +- std dev: 6.59 ms +- 0.33 ms -> 7.35 ms +- 0.51 ms: 1.12x slower
Significant (t=-9.68)

### spectral_norm ###
Mean +- std dev: 225 ms +- 7 ms -> 252 ms +- 9 ms: 1.12x slower
Significant (t=-18.22)

### sqlalchemy_declarative ###
Mean +- std dev: 239 ms +- 8 ms -> 246 ms +- 9 ms: 1.03x slower
Significant (t=-4.55)

### sqlalchemy_imperative ###
Mean +- std dev: 46.4 ms +- 1.7 ms -> 48.7 ms +- 2.5 ms: 1.05x slower
Significant (t=-6.00)

### sqlite_synth ###
Mean +- std dev: 5.27 us +- 0.26 us -> 5.35 us +- 0.30 us: 1.02x slower
Not significant

### sympy_expand ###
Mean +- std dev: 740 ms +- 47 ms -> 743 ms +- 35 ms: 1.00x slower
Not significant

### sympy_integrate ###
Mean +- std dev: 33.7 ms +- 1.5 ms -> 33.8 ms +- 1.3 ms: 1.00x slower
Not significant

### sympy_str ###
Mean +- std dev: 348 ms +- 19 ms -> 352 ms +- 22 ms: 1.01x slower
Not significant

### sympy_sum ###
Mean +- std dev: 167 ms +- 8 ms -> 174 ms +- 9 ms: 1.04x slower
Significant (t=-4.65)

### telco ###
Mean +- std dev: 10.1 ms +- 0.2 ms -> 10.0 ms +- 0.3 ms: 1.01x faster
Not significant

### tornado_http ###
Mean +- std dev: 289 ms +- 14 ms -> 294 ms +- 13 ms: 1.02x slower
Not significant

### unpack_sequence ###
Mean +- std dev: 77.6 ns +- 3.1 ns -> 80.1 ns +- 5.6 ns: 1.03x slower
Significant (t=-3.03)

### unpickle ###
Mean +- std dev: 26.2 us +- 2.4 us -> 25.1 us +- 1.3 us: 1.04x faster
Significant (t=3.15)

### unpickle_list ###
Mean +- std dev: 5.64 us +- 0.19 us -> 5.83 us +- 0.14 us: 1.03x slower
Significant (t=-6.38)

### unpickle_pure_python ###
Mean +- std dev: 664 us +- 36 us -> 701 us +- 30 us: 1.06x slower
Significant (t=-6.23)

### xml_etree_generate ###
Mean +- std dev: 194 ms +- 6 ms -> 193 ms +- 11 ms: 1.01x faster
Not significant

### xml_etree_iterparse ###
Mean +- std dev: 167 ms +- 10 ms -> 171 ms +- 10 ms: 1.02x slower
Significant (t=-2.25)

### xml_etree_parse ###
Mean +- std dev: 208 ms +- 8 ms -> 207 ms +- 10 ms: 1.00x faster
Not significant

### xml_etree_process ###
Mean +- std dev: 151 ms +- 4 ms -> 152 ms +- 4 ms: 1.01x slower
Not significant

Test runner:

#!/bin/bash
set -e

cat <<EOF > Dockerfile
FROM python:3.7-stretch
RUN python3 -m pip install performance
EOF

cat <<EOF > Dockerfile-with-dtrace
FROM 0x74696d/python:3.7-stretch
RUN python3 -m pip install performance
EOF

docker build -t="python-bench-no-dtrace" .
docker build -t="python-bench-with-dtrace" -f Dockerfile-with-dtrace .

echo 'Running benchmark for Python3 baseline'
docker run -d \
       --name bench1 \
       -w /tmp \
       -v $(pwd):/tmp \
       python-bench-no-dtrace \
       pyperformance run --python=python3 -o python-no-dtrace.json
time docker wait bench1

echo 'Running benchmark for Python3 with dtrace, no traces enabled'
docker run -d \
       --name bench2 \
       -w /tmp \
       -v $(pwd):/tmp \
       python-bench-with-dtrace \
       pyperformance run --python=python3 -o python-dtrace-not-tracing.json
time docker wait bench2

pyperformance compare python-no-dtrace.json python-dtrace-not-tracing.json > results.txt

@tgross tgross closed this Feb 3, 2019
@tianon
Copy link
Member

tianon commented Feb 4, 2019

Indeed, thanks for your time (and absolute thoroughness) on this, @tgross! ❤️

@alban
Copy link

alban commented Oct 28, 2019

There's another approach worth exploring for eBPF tracing of the interpreter without compiling-in usdt, so I may try to tackle publishing a reproducible method for that for folks to use.

@tgross what is this other approach?

Thanks!

@tgross
Copy link
Author

tgross commented Oct 29, 2019

@tgross what is this other approach?

Instrumenting the function entry/exit the tracepoints wrap with a uprobe. I didn't find this was feasible though; it's not portable and a lot of trouble in any event. If you want instrumentation of the Python runtime without usdt you might want to consider a stochastic profiler such as pyflame. Otherwise take the performance hit for your image, at least while profiling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants