Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat][misc] PIP-320: Add OpenTelemetry scaffolding #22010

Merged
merged 139 commits into from
Feb 9, 2024

Conversation

dragosvictor
Copy link
Contributor

@dragosvictor dragosvictor commented Feb 1, 2024

PIP-320

Motivation

PIP-264 laid out the foundation for switching our entire metrics pipeline to OpenTelemetry. PIP-320 describes the first step in this process, adding an SDK wrapper to instantiate an OpenTelemetry client with a couple of sane defaults for Pulsar brokers, proxies and function workers. This PR provides the implementation for the new wrapper.

Modifications

  • Add new artifact pulsar-otel-metrics-provider to encapsulate the new OpenTelemetryService. Our immediate goal is to make the service available to the broker, proxy, and function worker. Putting this in a new artifact allows us to cherry-pick this dependency as needed. Alternatively, this could be moved to pulsar-common, but it would then leak into the clients as well, which are out of scope for now. For the desired out-of-the-box experience, this artifact additionally pulls in the OTLP and Prometheus exporter dependencies.
  • Add the OpenTelemetryService class as a wrapper for the safe instantiation of the OpenTelemetry SDK. It serves the following purposes:
    • Disable OpenTelemetry by default. Can be overridden by the customer using environment variables or system properties.
    • Adjust the default number of permissible attributes per Meter (called the cardinality limit) from 2000 to 10000.
    • Add pulsar.cluster as a resource attribute (aka label) to all metrics emitted by this SDK instance. Note that this parameter cannot be null or empty. Many of the existing Proxy tests were not setting this value in their configuration and had to be adapted. This value can also be overridden by environment variables or system properties.
    • With the SDK instantiated, provide access to OpenTelemetry Meter objects. These are used to emit the actual metrics.

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

  • Added unit test for class OpenTelemetryService, including:
    • testClusterNameCannotBeEmpty
    • testClusterNameCannotBeNull
    • testResourceAttributesAreSet
    • testIsInstrumentationNameSetOnMeter
    • testMetricCardinality
    • testLongCounter: verifies basic integer counter metrics can be emitted
    • testServiceIsDisabledByDefault
  • Added integration test class MetricsTest, validating the entire end-to-end metrics pipeline for brokers, proxies, and function-workers, using both the in-process Prometheus exporter and remote OTLP collector. Separated to run in its own CI integration test target, as it did not naturally fit anywhere else.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
    • OpenTelemetry SDK libraries
    • OpenTelemetry SDK Autoconfigure Extension
      • This references the OpenTelemetry logs and traces libraries, too, which we don't need. Removing them does not work, though, as the Autoconfigure Extension does not load properly without them.
    • OpenTelemetry OTLP exporter
    • OpenTelemetry Prometheus exporter
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
    • This PR adds support for further migration of metrics to OpenTelemetry. No metrics are being changed as part of this PR itself.
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: dragosvictor#6

Copy link
Contributor

@asafm asafm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks brilliant now, thanks for fixing all the comments!

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good work @dragosvictor

@codecov-commenter
Copy link

codecov-commenter commented Feb 9, 2024

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (3036783) 36.56% compared to head (e7696fc) 73.64%.
Report is 1 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #22010       +/-   ##
=============================================
+ Coverage     36.56%   73.64%   +37.07%     
- Complexity    12418    32043    +19625     
=============================================
  Files          1729     1870      +141     
  Lines        132076   139039     +6963     
  Branches      14452    15245      +793     
=============================================
+ Hits          48295   102394    +54099     
+ Misses        77387    28712    -48675     
- Partials       6394     7933     +1539     
Flag Coverage Δ
inttests 24.65% <64.78%> (+0.51%) ⬆️
systests 24.36% <64.78%> (+0.39%) ⬆️
unittests 72.91% <87.32%> (+40.94%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
.../pulsar/opentelemetry/OpenTelemetryAttributes.java 100.00% <100.00%> (ø)
...n/java/org/apache/pulsar/broker/PulsarService.java 82.15% <75.00%> (+13.30%) ⬆️
...pulsar/broker/stats/PulsarBrokerOpenTelemetry.java 90.90% <90.90%> (ø)
...ar/functions/worker/PulsarWorkerOpenTelemetry.java 90.90% <90.90%> (ø)
...a/org/apache/pulsar/proxy/server/ProxyService.java 79.16% <75.00%> (+29.16%) ⬆️
...e/pulsar/proxy/stats/PulsarProxyOpenTelemetry.java 90.90% <90.90%> (ø)
...e/pulsar/functions/worker/PulsarWorkerService.java 69.91% <50.00%> (+10.50%) ⬆️
...che/pulsar/opentelemetry/OpenTelemetryService.java 92.00% <92.00%> (ø)

... and 1439 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-required Your PR changes impact docs and you will update later. ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants