adds USDT trace points for supported distros #366

tgross · 2019-01-20T00:58:44Z

Python supports user-level statically defined tracing (USDT) for platforms that support it. This has historically been not compiled-in by default in Linux distributions because only hosts with systemtap can take advantage of it. Newer Linux kernels (including all kernels supported by Docker) have eBPF, and this allows for all users to take advantage of tools like pythonflow, pythongc, etc. Including this option would have the happy side-effect of allowing for the USDT probes to be used for DTrace on platforms where Docker containers can be run on Illumos (ex. Joyent's Triton).

The upstream implementation is a little weird (ref https://docs.python.org/3/howto/instrumentation.html). The --with-dtrace flag expects the dtrace binary to exist so that appropriate header files can be written. The systemtap-sdt-dev package for Debian provides a mock "dtrace binary" (actually a Python script) which can generate the header files.

This package or one like it is not available for Alpine. I have another branch with an experiment to vendor the header files and the mock dtrace binary from the Debian systemtap-sdt-dev package. This works inasmuch as the header gets created and I've verified the USDT probes work, but it's a totally gross approach and perhaps dubious from a licensing standpoint so I didn't include it in this PR. But I'd love to have a discussion about what the preferred approach would be.

yosifkit · 2019-01-21T19:01:24Z

Just to put some stats down so we have a couple comparisons:

on disk size of python:3.7-stretch: 925MB -> 927MB
on disk size of python:3.7-slim-stretch: 142MB -> 143MB
total build time (master on travis vs PR): 2 hrs 8 min -> 2 hrs 27 min
- this should be similar to the official build infrastructure

Overall, it might be worth the extra build time and small bump in size, tracing is nice to have. Is there a comparison of python performance before and after this flag?

On the other branch, I'm not sure vendoring in the Debian scripts for Alpine is the right way; that could dramatically increase the maintenance in this repo and distracts from its simplicity. Maybe we just skip Alpine for now and work to get the package into Alpine proper?

tgross · 2019-01-22T15:55:09Z

Is there a comparison of python performance before and after this flag?

I can put together a benchmark. Will have that in the next day or two.

Maybe we just skip Alpine for now and work to get the package into Alpine proper?

Seems like a reasonable approach.

yosifkit · 2019-01-22T22:09:36Z

Thanks @tgross!

CC @tianon, I think this is fine if there isn't a significant performance hit; what do you think?

tianon · 2019-01-25T23:56:55Z

2.7/jessie/Dockerfile

@@ -18,6 +18,7 @@ ENV PYTHONIOENCODING UTF-8
 # extra dependencies (over what buildpack-deps already includes)
 RUN apt-get update && apt-get install -y --no-install-recommends \
 		tk-dev \
+		systemtap-sdt-dev \


Does this provide any user-facing benefit when we keep it installed?

No, that's a miss. Will fix.

tianon · 2019-01-25T23:59:53Z

Seems generally sane, although I'm definitely curious to see the results of that benchmark. 👍

This is definitely supported for Python 2.x and 3.x? (https://docs.python.org/2/howto/instrumentation.html is a 404, so I want to make sure Python 2 isn't just quietly ignoring the ./configure flag 😄)

More concretely: https://travis-ci.org/docker-library/python/jobs/481912534#L1050

configure: WARNING: unrecognized options: --with-dtrace

(So I think this should definitely be removed from all the -caveman files too. 👍)

tgross · 2019-02-02T20:04:49Z

Sorry about the delay in replying.

I looked into the Python2 thing and it looks like it's not supported in upstream for Linux, only as patches for other Unixes. So we can remove that. I'm going to admit I'm not sure what the -caveman files are but I've removed it from there as requested. I've rebased this PR on master and pushed up the new commit.

I'm currently running Python's own benchmarking suite against both python:3.7-stretch and the one that I've built using this PR. I'm running the suite without running attaching any traces to the --with-dtrace container, as we want to make sure that the promise of "no runtime overhead when disabled" proves true. I'll also run a partial set of benchmarks with a trace enabled just because it might be interesting.

Not sure how long this benchmark takes to run but I'm just running it on a machine overnight and I'll post again here tomorrow with results.

tgross · 2019-02-03T14:39:21Z

The python benchmark suite I ran exercises 60 real-world applications. It takes a bit over 20 min to run on my lab machine, so I ran this a few times to verify the results I got here.

Bottom line: although many of the benchmarks don't show a significant difference between --with-dtrace and not, 26 of the 60 tests show a 5%-17% performance hit.

Needless to say I'm surprised and disappointed by this. Keep in mind this benchmark was run without an active trace, so I would not have expected the no-op function wrapping to create such a high level of overhead. I'm definitely glad I did the benchmarks, because accepting this PR would result in a significant community-wide hit. I'm going to withdraw this PR.

There's another approach worth exploring for eBPF tracing of the interpreter without compiling-in usdt, so I may try to tackle publishing a reproducible method for that for folks to use. But that's not specific to Docker.

Sorry for wasting your time @yosifkit & @tianon! 😊 But at least we have this data now for the next time someone suggests this.

Benchmark results

python-no-dtrace.json
=====================

Performance version: 0.7.0
Report on Linux-4.15.0-30-generic-x86_64-with-debian-9.6
Number of logical CPUs: 4
Start date: 2019-02-03 00:14:21.780332
End date: 2019-02-03 00:37:29.501749

python-dtrace-not-tracing.json
==============================

Performance version: 0.7.0
Report on Linux-4.15.0-30-generic-x86_64-with-debian-9.6
Number of logical CPUs: 4
Start date: 2019-02-03 00:37:32.906533
End date: 2019-02-03 01:01:19.249629

### 2to3 ###
Mean +- std dev: 495 ms +- 4 ms -> 537 ms +- 30 ms: 1.08x slower
Significant (t=-10.61)

### chameleon ###
Mean +- std dev: 15.8 ms +- 0.2 ms -> 17.7 ms +- 1.0 ms: 1.12x slower
Significant (t=-15.00)

### chaos ###
Mean +- std dev: 188 ms +- 1 ms -> 216 ms +- 8 ms: 1.15x slower
Significant (t=-25.94)

### crypto_pyaes ###
Mean +- std dev: 170 ms +- 2 ms -> 198 ms +- 12 ms: 1.17x slower
Significant (t=-18.29)

### deltablue ###
Mean +- std dev: 12.7 ms +- 0.2 ms -> 13.8 ms +- 0.4 ms: 1.09x slower
Significant (t=-20.25)

### django_template ###
Mean +- std dev: 214 ms +- 3 ms -> 251 ms +- 12 ms: 1.17x slower
Significant (t=-22.82)

### dulwich_log ###
Mean +- std dev: 113 ms +- 1 ms -> 116 ms +- 3 ms: 1.03x slower
Significant (t=-10.72)

### fannkuch ###
Mean +- std dev: 767 ms +- 78 ms -> 835 ms +- 40 ms: 1.09x slower
Significant (t=-6.03)

### float ###
Mean +- std dev: 176 ms +- 6 ms -> 185 ms +- 8 ms: 1.05x slower
Significant (t=-6.36)

### genshi_text ###
Mean +- std dev: 50.9 ms +- 1.4 ms -> 53.5 ms +- 2.6 ms: 1.05x slower
Significant (t=-7.03)

### genshi_xml ###
Mean +- std dev: 114 ms +- 6 ms -> 118 ms +- 7 ms: 1.04x slower
Significant (t=-3.46)

### go ###
Mean +- std dev: 467 ms +- 23 ms -> 478 ms +- 27 ms: 1.02x slower
Significant (t=-2.35)

### hexiom ###
Mean +- std dev: 17.8 ms +- 0.5 ms -> 19.5 ms +- 1.3 ms: 1.10x slower
Significant (t=-9.53)

### html5lib ###
Mean +- std dev: 153 ms +- 6 ms -> 160 ms +- 7 ms: 1.05x slower
Significant (t=-6.35)

### json_dumps ###
Mean +- std dev: 20.0 ms +- 1.1 ms -> 20.9 ms +- 1.2 ms: 1.04x slower
Significant (t=-4.12)

### json_loads ###
Mean +- std dev: 41.9 us +- 0.9 us -> 44.9 us +- 1.6 us: 1.07x slower
Significant (t=-12.83)

### logging_format ###
Mean +- std dev: 19.3 us +- 1.1 us -> 20.2 us +- 1.4 us: 1.04x slower
Significant (t=-3.65)

### logging_silent ###
Mean +- std dev: 331 ns +- 19 ns -> 345 ns +- 11 ns: 1.04x slower
Significant (t=-5.20)

### logging_simple ###
Mean +- std dev: 16.8 us +- 0.4 us -> 17.4 us +- 0.6 us: 1.03x slower
Significant (t=-6.19)

### mako ###
Mean +- std dev: 31.0 ms +- 0.7 ms -> 36.2 ms +- 2.2 ms: 1.17x slower
Significant (t=-17.16)

### meteor_contest ###
Mean +- std dev: 151 ms +- 3 ms -> 161 ms +- 8 ms: 1.07x slower
Significant (t=-9.27)

### nbody ###
Mean +- std dev: 199 ms +- 8 ms -> 204 ms +- 13 ms: 1.02x slower
Significant (t=-2.52)

### nqueens ###
Mean +- std dev: 172 ms +- 11 ms -> 174 ms +- 7 ms: 1.01x slower
Not significant

### pathlib ###
Mean +- std dev: 46.5 ms +- 2.2 ms -> 45.9 ms +- 1.8 ms: 1.01x faster
Not significant

### pickle ###
Mean +- std dev: 16.9 us +- 0.7 us -> 16.7 us +- 0.9 us: 1.01x faster
Not significant

### pickle_dict ###
Mean +- std dev: 50.0 us +- 2.2 us -> 51.2 us +- 2.1 us: 1.02x slower
Significant (t=-3.02)

### pickle_list ###
Mean +- std dev: 6.78 us +- 0.34 us -> 6.92 us +- 0.20 us: 1.02x slower
Significant (t=-2.78)

### pickle_pure_python ###
Mean +- std dev: 831 us +- 25 us -> 883 us +- 30 us: 1.06x slower
Significant (t=-10.25)

### pidigits ###
Mean +- std dev: 245 ms +- 14 ms -> 250 ms +- 12 ms: 1.02x slower
Significant (t=-2.10)

### python_startup ###
Mean +- std dev: 11.9 ms +- 0.1 ms -> 11.8 ms +- 0.1 ms: 1.00x faster
Not significant

### python_startup_no_site ###
Mean +- std dev: 8.22 ms +- 0.02 ms -> 8.17 ms +- 0.01 ms: 1.01x faster
Not significant

### raytrace ###
Mean +- std dev: 966 ms +- 91 ms -> 1031 ms +- 57 ms: 1.07x slower
Significant (t=-4.72)

### regex_compile ###
Mean +- std dev: 336 ms +- 20 ms -> 323 ms +- 11 ms: 1.04x faster
Significant (t=4.28)

### regex_dna ###
Mean +- std dev: 219 ms +- 5 ms -> 225 ms +- 13 ms: 1.03x slower
Significant (t=-3.46)

### regex_effbot ###
Mean +- std dev: 4.01 ms +- 0.14 ms -> 4.42 ms +- 0.31 ms: 1.10x slower
Significant (t=-9.38)

### regex_v8 ###
Mean +- std dev: 32.8 ms +- 0.9 ms -> 34.4 ms +- 1.0 ms: 1.05x slower
Significant (t=-9.27)

### richards ###
Mean +- std dev: 145 ms +- 12 ms -> 143 ms +- 5 ms: 1.02x faster
Not significant

### scimark_fft ###
Mean +- std dev: 544 ms +- 27 ms -> 572 ms +- 32 ms: 1.05x slower
Significant (t=-4.95)

### scimark_lu ###
Mean +- std dev: 312 ms +- 19 ms -> 335 ms +- 17 ms: 1.07x slower
Significant (t=-6.94)

### scimark_monte_carlo ###
Mean +- std dev: 180 ms +- 5 ms -> 190 ms +- 5 ms: 1.05x slower
Significant (t=-9.80)

### scimark_sor ###
Mean +- std dev: 363 ms +- 21 ms -> 378 ms +- 22 ms: 1.04x slower
Significant (t=-3.91)

### scimark_sparse_mat_mult ###
Mean +- std dev: 6.59 ms +- 0.33 ms -> 7.35 ms +- 0.51 ms: 1.12x slower
Significant (t=-9.68)

### spectral_norm ###
Mean +- std dev: 225 ms +- 7 ms -> 252 ms +- 9 ms: 1.12x slower
Significant (t=-18.22)

### sqlalchemy_declarative ###
Mean +- std dev: 239 ms +- 8 ms -> 246 ms +- 9 ms: 1.03x slower
Significant (t=-4.55)

### sqlalchemy_imperative ###
Mean +- std dev: 46.4 ms +- 1.7 ms -> 48.7 ms +- 2.5 ms: 1.05x slower
Significant (t=-6.00)

### sqlite_synth ###
Mean +- std dev: 5.27 us +- 0.26 us -> 5.35 us +- 0.30 us: 1.02x slower
Not significant

### sympy_expand ###
Mean +- std dev: 740 ms +- 47 ms -> 743 ms +- 35 ms: 1.00x slower
Not significant

### sympy_integrate ###
Mean +- std dev: 33.7 ms +- 1.5 ms -> 33.8 ms +- 1.3 ms: 1.00x slower
Not significant

### sympy_str ###
Mean +- std dev: 348 ms +- 19 ms -> 352 ms +- 22 ms: 1.01x slower
Not significant

### sympy_sum ###
Mean +- std dev: 167 ms +- 8 ms -> 174 ms +- 9 ms: 1.04x slower
Significant (t=-4.65)

### telco ###
Mean +- std dev: 10.1 ms +- 0.2 ms -> 10.0 ms +- 0.3 ms: 1.01x faster
Not significant

### tornado_http ###
Mean +- std dev: 289 ms +- 14 ms -> 294 ms +- 13 ms: 1.02x slower
Not significant

### unpack_sequence ###
Mean +- std dev: 77.6 ns +- 3.1 ns -> 80.1 ns +- 5.6 ns: 1.03x slower
Significant (t=-3.03)

### unpickle ###
Mean +- std dev: 26.2 us +- 2.4 us -> 25.1 us +- 1.3 us: 1.04x faster
Significant (t=3.15)

### unpickle_list ###
Mean +- std dev: 5.64 us +- 0.19 us -> 5.83 us +- 0.14 us: 1.03x slower
Significant (t=-6.38)

### unpickle_pure_python ###
Mean +- std dev: 664 us +- 36 us -> 701 us +- 30 us: 1.06x slower
Significant (t=-6.23)

### xml_etree_generate ###
Mean +- std dev: 194 ms +- 6 ms -> 193 ms +- 11 ms: 1.01x faster
Not significant

### xml_etree_iterparse ###
Mean +- std dev: 167 ms +- 10 ms -> 171 ms +- 10 ms: 1.02x slower
Significant (t=-2.25)

### xml_etree_parse ###
Mean +- std dev: 208 ms +- 8 ms -> 207 ms +- 10 ms: 1.00x faster
Not significant

### xml_etree_process ###
Mean +- std dev: 151 ms +- 4 ms -> 152 ms +- 4 ms: 1.01x slower
Not significant

Test runner:

#!/bin/bash
set -e

cat <<EOF > Dockerfile
FROM python:3.7-stretch
RUN python3 -m pip install performance
EOF

cat <<EOF > Dockerfile-with-dtrace
FROM 0x74696d/python:3.7-stretch
RUN python3 -m pip install performance
EOF

docker build -t="python-bench-no-dtrace" .
docker build -t="python-bench-with-dtrace" -f Dockerfile-with-dtrace .

echo 'Running benchmark for Python3 baseline'
docker run -d \
       --name bench1 \
       -w /tmp \
       -v $(pwd):/tmp \
       python-bench-no-dtrace \
       pyperformance run --python=python3 -o python-no-dtrace.json
time docker wait bench1

echo 'Running benchmark for Python3 with dtrace, no traces enabled'
docker run -d \
       --name bench2 \
       -w /tmp \
       -v $(pwd):/tmp \
       python-bench-with-dtrace \
       pyperformance run --python=python3 -o python-dtrace-not-tracing.json
time docker wait bench2

pyperformance compare python-no-dtrace.json python-dtrace-not-tracing.json > results.txt

tianon · 2019-02-04T23:47:09Z

Indeed, thanks for your time (and absolute thoroughness) on this, @tgross! ❤️

alban · 2019-10-28T09:54:31Z

There's another approach worth exploring for eBPF tracing of the interpreter without compiling-in usdt, so I may try to tackle publishing a reproducible method for that for folks to use.

@tgross what is this other approach?

Thanks!

tgross · 2019-10-29T01:04:16Z

@tgross what is this other approach?

Instrumenting the function entry/exit the tracepoints wrap with a uprobe. I didn't find this was feasible though; it's not portable and a lot of trouble in any event. If you want instrumentation of the Python runtime without usdt you might want to consider a stochastic profiler such as pyflame. Otherwise take the performance hit for your image, at least while profiling.

yosifkit requested a review from tianon January 22, 2019 22:09

tianon reviewed Jan 25, 2019

View reviewed changes

adds USDT trace points for supported distros

c524f2b

tgross force-pushed the usdt branch from e1d591a to c524f2b Compare February 2, 2019 19:57

tgross closed this Feb 3, 2019

tianon mentioned this pull request Apr 12, 2019

Default dtrace support docker-library/php#813

Open

tianon mentioned this pull request Nov 15, 2019

KM-376 built a debug version of the python-3.6 docker imnage, on whic… #433

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds USDT trace points for supported distros #366

adds USDT trace points for supported distros #366

tgross commented Jan 20, 2019

yosifkit commented Jan 21, 2019

tgross commented Jan 22, 2019

yosifkit commented Jan 22, 2019

tianon Jan 25, 2019

tgross Feb 2, 2019

tianon commented Jan 25, 2019

tgross commented Feb 2, 2019 •

edited

Loading

tgross commented Feb 3, 2019

tianon commented Feb 4, 2019

alban commented Oct 28, 2019

tgross commented Oct 29, 2019

adds USDT trace points for supported distros #366

adds USDT trace points for supported distros #366

Conversation

tgross commented Jan 20, 2019

yosifkit commented Jan 21, 2019

tgross commented Jan 22, 2019

yosifkit commented Jan 22, 2019

tianon Jan 25, 2019

Choose a reason for hiding this comment

tgross Feb 2, 2019

Choose a reason for hiding this comment

tianon commented Jan 25, 2019

tgross commented Feb 2, 2019 • edited Loading

tgross commented Feb 3, 2019

tianon commented Feb 4, 2019

alban commented Oct 28, 2019

tgross commented Oct 29, 2019

tgross commented Feb 2, 2019 •

edited

Loading