Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve OpenTelemetry Collector JSON Marshalling #3286

Closed
wants to merge 23 commits into from

Conversation

iblancasa
Copy link
Contributor

Fixes #3281

@iblancasa iblancasa requested a review from a team September 12, 2024 16:12
Signed-off-by: Israel Blancas <[email protected]>
@iblancasa iblancasa requested a review from a team as a code owner September 19, 2024 13:10
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what is actually going on in this PR. From the tests, it seems like everything works correctly, but I don't know why. I've left some comments asking questions, but in general I think you should exhaustively explain the weird marshaling functions in docstrings.

apis/v1beta1/config.go Outdated Show resolved Hide resolved
apis/v1beta1/config.go Outdated Show resolved Hide resolved
@iblancasa
Copy link
Contributor Author

I don't understand what is actually going on in this PR. From the tests, it seems like everything works correctly, but I don't know why.

I tried to explain it in the changelog:

Previously, the v1beta1 OpenTelemetryCollector CR's config field was incorrectly
marshalled, potentially causing issues when storing in etcd or retrieving via kubectl.
This fix ensures the config is properly marshaled, maintaining the correct
structure and data types. For example, a config with nested maps and arrays will now
be correctly preserved when stored and retrieved, instead of being flattened or
losing structure.

The problem happens when you want to create one OpenTelemetry Collector instance like this one:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
    config:
        receivers:
            otlp:
                http:
                grpc:

Then, you create it in the cluster with, for instance, kubectl.

kubectl apply -f otelcol.yaml

When you get the OpenTelemetry Collector instance, you will get this:

$ kubectl get otelcol -o yaml otel
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
    config:
        receivers:
            otlp:

We have a warning mentioning this can happen, but it can be difficult for users to see.

I'll add more info to the changelog.

I've left some comments asking questions, but in general I think you should exhaustively explain the weird marshaling functions in docstrings.

I just added some comments to the source code.

@iblancasa iblancasa requested a review from swiatekm September 30, 2024 09:49
Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not entirely sure how or why this works... maybe we can talk about it in slack?

// This allows you to embed the struct safely within itself without triggering that infinite loop.
type Alias Config
// Ensure we call the custom marshaller for AnyConfig for the Receivers
receiversJSON, err := json.Marshal(c.Receivers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we only need to do this for receivers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iblancasa bump on this

apis/v1beta1/config.go Outdated Show resolved Hide resolved
@iblancasa
Copy link
Contributor Author

I'm still not entirely sure how or why this works... maybe we can talk about it in slack?

Just messaged you.

@iblancasa
Copy link
Contributor Author

The E2E tests are failing because there was a problem while downloading a dependency or something.

@iblancasa
Copy link
Contributor Author

@jaronoff97 @swiatekm do you have any other questions or concerns regarding this PR?

.chloggen/3281-marshal-config.yaml Show resolved Hide resolved
// This allows you to embed the struct safely within itself without triggering that infinite loop.
type Alias Config
// Ensure we call the custom marshaller for AnyConfig for the Receivers
receiversJSON, err := json.Marshal(c.Receivers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iblancasa bump on this

Comment on lines +44 to +62
testScheme := scheme.Scheme
err := AddToScheme(testScheme)
require.NoError(t, err)

testEnv := &envtest.Environment{
CRDDirectoryPaths: []string{filepath.Join("..", "..", "config", "crd", "bases")},
}

cfg, err := testEnv.Start()
require.NoError(t, err)
defer func() {
errStop := testEnv.Stop()
require.NoError(t, errStop)
}()

k8sClient, err := client.New(cfg, client.Options{Scheme: testScheme})
if err != nil {
fmt.Printf("failed to setup a Kubernetes client: %v", err)
os.Exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ive found this fake client package a lot more usable

name: stateful-collector-85dbe673
name: stateful-collector-57be076a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC why are the hashes changing? id be interested to understand how the full content is being changed.

@iblancasa
Copy link
Contributor Author

Closing this PR because the changes introduced in #3333 workaround the problem.

Since it seems it was a problem for so many people to follow what's going on this PR, I'll add an explanation here just as a reference.

The issue

After creating an OpenTelemetryCollector instance from a YAML file like this one:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
    config:
        receivers:
            otlp:
                protocols:
                    http:
                    grpc:

If you get the OpenTelemetryCollector instance from Kubernetes:

kubectl get otelcol -o yaml otel

this is what you get:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
    config:
        receivers:
            otlp:
                protocols:

The values for protocols are missing. This does not just apply to otlp receiver but to all those configurations with a key but not a value -like what happens for grpc or http in the previous example.

This is not supported by other tools like kustomize but, if you are an OpenTelemetry Collector user, you would expect to be able to use the same kind of config that you would use with the regular Go binary.

The operator shows a warning in this situation.

The fix

The fix is based on this blog article.

I was implementing the MarshalJSON for the OpenTelemetry, OpenTelemetrySpec and Config structs to ensure the custom MarshalJSON method from AnyConfig would be called -because it seems in some scenarios the custom MarshalJSON implementations are not called when one struct is part of another, I don't get why-. I was forcing the call to the custom MarshalJSON only for receivers because are the only kind of components where this situation can be reproduced. Why? Because receiver components are the only ones that contain configurations where keys without values are valid.

I added a unit test to ensure this situation, which seems to be reproducible only when the object is created + retrieved in Kubernetes, didn't reproduce again.

The unexpected workaround

With #3333, all those keys that would not have any value, have now a value by default. So, the issue is there but users will not suffer it anymore.

@swiatekm
Copy link
Contributor

@iblancasa I think I finally understand, thanks for being so patient with us! Could we actually keep the tests you've written, and skip them for the time being? We've been talking about alternative solutions to this issue, and the tests would be very helpful as benchmarks. I'd also hate for your hard work to go to waste this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong marshal for OpenTelemetryCollector
4 participants