Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote Configuration Capability of Supervisor is not restarting my collector if the configuration of the collector is changes #32959

Closed
MSA0208 opened this issue May 9, 2024 · 35 comments
Labels
bug Something isn't working cmd/opampsupervisor Stale

Comments

@MSA0208
Copy link

MSA0208 commented May 9, 2024

Component(s)

No response

Describe the issue you're reporting

Hi Team,

Currently have connected my opamp-server, opamp supervisor which has an executable of my collector and its running fine using the below supervisor.yaml
server:
endpoint: ws://127.0.0.1:4320/v1/opamp
agent:
executable: /root/OTEL98/opentelemetry-collector-contrib-main/cmd/otelcontribcol/OutputBinaries/NGxConnector

args: --config /root/OTEL98/opentelemetry-collector-contrib-main/cmd/otelcontribcol/config.yaml

, Now i added the capability of the supervisor to accept the remote configurations i.e,
server:
endpoint: ws://127.0.0.1:4320/v1/opamp
capabilities:
AcceptsRemoteConfig: true
agent:
executable: /root/OTEL98/opentelemetry-collector-contrib-main/cmd/otelcontribcol/OutputBinaries/NGxConnector

args: --config /root/OTEL98/opentelemetry-collector-contrib-main/cmd/otelcontribcol/config.yaml

after adding this change in my supervisor.yaml and starting the supervisor to run my executable, its fine

My actual problem is when i change the config.yaml of the collector pipeline , the same is not reflected on the supervisor or the agent side . please help me out to get this remote config working

am using the otel-collector-main latest version, along with opamp-go-main latest version and also the extension to my collector-contrib-main version of the OTEL code

@MSA0208 MSA0208 added the needs triage New item requiring triage label May 9, 2024
@crobert-1 crobert-1 added bug Something isn't working cmd/opampsupervisor labels May 9, 2024
Copy link
Contributor

github-actions bot commented May 9, 2024

Pinging code owners for cmd/opampsupervisor: @evan-bradley @atoulme @tigrannajaryan. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@evan-bradley
Copy link
Contributor

Hi, @MSA0208. When you say you are changing the config.yaml file, do you mean this one?

/root/OTEL98/opentelemetry-collector-contrib-main/cmd/otelcontribcol/config.yaml

The Supervisor will only restart the Collector when it receives new configuration from the OpAMP server; changes to files on disk will not restart the Collector. If you are changing the code in the OpAMP server, do you see any logs in the Supervisor about receiving new config?

@evan-bradley evan-bradley removed the needs triage New item requiring triage label May 9, 2024
@open-telemetry open-telemetry deleted a comment from MSA0208 May 10, 2024
@evan-bradley
Copy link
Contributor

@MSA0208 I deleted your comment because I noticed there were some credentials in there. I would suggest you rotate the tokens and change the passwords used in your config.

@evan-bradley
Copy link
Contributor

Hi @evan-bradley
Thank you so much for the Reply.

I want to know what kind of changes and which file change will result in restart of the Collector.
yes i have added the logs from the Opamp Server code and also in the supervisor code from this github
https://github.com/open-telemetry/opamp-go/tree/main/internal/examples/supervisor
from this code , i have added logs althrough the methods , and found effective.yaml is the one which gets executed >along with the args passed , so tried changing effective.yaml Manually and tried , but still dint work
and supervisor/bin folder has [...]
Have placed my actual config.yaml required for the collector in the same folder for testing purpose and tried modifying it, but that dint restart my collector.

Please let me know the exact steps to follow to restart the collector on what dynamic changes

Thanks for the details. The only file you should modify directly is the Supervisor's configuration file. When using the Supervisor, all Collector configuration updates should be made through the OpAMP server, which will send them to the Supervisor and restart the Collector with the new config. The effective.yaml file should not be directly edited, it's only intended to be created/updated by the Supervisor.

@MSA0208
Copy link
Author

MSA0208 commented May 10, 2024

Hi @evan-bradley,

Thank you , Let me try this and get back to you.
sorry missed to mask or remove my creds from the config.yaml

@tigrannajaryan
Copy link
Member

tigrannajaryan commented May 10, 2024

The effective.yaml file should not be directly edited, it's only intended to be created/updated by the Supervisor.

To avoid future user confusion should we prepend effective.yaml file with a comment telling that it is autogenerated, is not meant to be user-editable and will be overwritten by supervisor?

@evan-bradley
Copy link
Contributor

I was thinking the same thing, we should clearly indicate which files are not intended to be modified by the user.

@MSA0208
Copy link
Author

MSA0208 commented May 15, 2024

Hi @evan-bradley,

am using otel contrib Supervisor , https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/cmd/opampsupervisor , instead of this opamp-go given supervisor. as it has some additional capabilities of remote configurations specified as per document which are needed for my usecase and i follow the document exactly and the server is starting , but getting the issue below
*configFlag supervisor.yaml
Config Loaded supervisor.yaml
2024-05-15T02:44:15.515-0700 DEBUG commander/commander.go:74 Starting agent {"agent": "/root/OTEL98/opentelemetry-collector-contrib-main/cmd/otelcontribcol/OutputBinaries/NGxConnector"}
2024-05-15T02:44:15.516-0700 DEBUG commander/commander.go:93 Agent process started {"pid": 60790}
2024-05-15T02:44:18.518-0700 ERROR opampsupervisor/main.go:26 could not get bootstrap info from the Collector: collector's OpAMP client never connected to the Supervisor
main.main
/root/OTEL98/opentelemetry-collector-contrib-main/cmd/opampsupervisor/main.go:26
runtime.main
/usr/local/go/src/runtime/proc.go:271

Can you please help me with the things to configure to solve this issue

@MSA0208
Copy link
Author

MSA0208 commented May 16, 2024

Hi @everyone,

Expecting the solution response!!

@MSA0208
Copy link
Author

MSA0208 commented May 27, 2024

I could able to solve the above issue,

Now the issue am facing is Agent is not healthy, meaning have started my Supervisor on some random port and that is starting my agent collector but my agent says unable to connect to the supervisor, giving the statement Connection Refused.

below is the sample collector config.yaml used for agent collector and the same am using as my bootstrap.yaml .

collector-config.yaml
extensions:
opamp:
instance_uid: 01HYAH3BNC06AFVGQT5ZYC0GEK
server:
ws:
endpoint: ws://127.0.0.1:4322/v1/opamp
health_check:
endpoint: "localhost:4444"
#tls:
# ca_file: "/path/to/ca.crt"
# cert_file: "/path/to/cert.crt"
# key_file: "/path/to/key.key"
path: "/health/status"
check_collector_pipeline:
enabled: true
interval: "5m"
exporter_failure_threshold: 5

Let me know what else could be causing the issue or redirect me to the fix which has solved this agent Health and also after receiving my remote config, supervisor is unable to restart my agent collector , am thinking this could be because of the connection issue

opamp-extension/agent log*****

2024-05-27T05:59:37.069-0700 error [email protected]/opamp_agent.go:72 Failed to connect to the OpAMP server {"kind": "extension", "name": "opamp", "error": "dial tcp 127.0.0.1:4322: connect: connection refused"}
github.com/open-telemetry/opentelemetry-collector-contrib/extension/opampextension.(*opampAgent).Start.func2
github.com/open-telemetry/opentelemetry-collector-contrib/extension/[email protected]/opamp_agent.go:72
github.com/open-telemetry/opamp-go/client/types.CallbacksStruct.OnConnectFailed
github.com/open-telemetry/[email protected]/client/types/callbacks.go:149
github.com/open-telemetry/opamp-go/client.(*wsClient).tryConnectOnce
github.com/open-telemetry/[email protected]/client/wsclient.go:153
github.com/open-telemetry/opamp-go/client.(*wsClient).ensureConnected
github.com/open-telemetry/[email protected]/client/wsclient.go:217
github.com/open-telemetry/opamp-go/client.(*wsClient).runOneCycle
github.com/open-telemetry/[email protected]/client/wsclient.go:261
github.com/open-telemetry/opamp-go/client.(*wsClient).runUntilStopped
github.com/open-telemetry/[email protected]/client/wsclient.go:346
github.com/open-telemetry/opamp-go/client/internal.(*ClientCommon).StartConnectAndRun.func1
github.com/open-telemetry/[email protected]/client/internal/clientcommon.go:202
2024-05-27T05:59:37.069-0700 error [email protected]/logger.go:26

****supervisor logs

Response from HealthChecker: &{404 Not Found 404 HTTP/1.1 1 1 map[Content-Length:[19] Content-Type:[text/plain; charset=utf-8] Date:[Mon, 27 May 2024 12:51:43 GMT] X-Content-Type-Options:[nosniff]] 0xc000040120 19 [] false false map[] 0xc0002165a0 }
health check on %s returned %d
http://localhost:4444/
404
2024-05-27T05:51:43.834-0700 ERROR supervisor/supervisor.go:884 Agent is not healthy {"error": "health check on
http://localhost:4444/
returned 404"}
github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).healthCheck
/root/OTEL98/opentelemetry-collector-contrib-main/cmd/opampsupervisor/supervisor/supervisor.go:884
github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).runAgentProcess
/root/OTEL98/opentelemetry-collector-contrib-main/cmd/opampsupervisor/supervisor/supervisor.go:955
github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.NewSupervisor.func1
/root/OTEL98/opentelemetry-collector-contrib-main/cmd/opampsupervisor/supervisor/supervisor.go:207
Inside SetHealth from clientCommon.go!!!: start_time_unix_nano:1716810290622218705 last_error:"health check on
http://localhost:4444/
returned 404"

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 29, 2024
@Asarew
Copy link

Asarew commented Jul 30, 2024

@MSA0208 How did you solve the issue of:

could not get bootstrap info from the Collector: collector's OpAMP client never connected to the Supervisor

@MSA0208
Copy link
Author

MSA0208 commented Jul 31, 2024

hi @Asarew ,

Collector was not getting started with nop, i provided the actual service configuration along with opamp extension configured

@github-actions github-actions bot removed the Stale label Jul 31, 2024
@cforce
Copy link

cforce commented Jul 31, 2024

"Collector was not getting started with nop, "
Sounds like a feature , not a bug
Why shall it start if there is no todo?

@MSA0208
Copy link
Author

MSA0208 commented Jul 31, 2024

Related Issue :[cmd/opampsupervisor] Use nop components during bootstrapping #32554

@Asarew
Copy link

Asarew commented Jul 31, 2024

My issue was that i didn't build the nop receiver and exporter with the collector.

@MSA0208
Copy link
Author

MSA0208 commented Jul 31, 2024

Hope your issue is solved now

Now the current issue am facing is, have configured some random port for the supervisor and started my collector with supervisor and opamp server, server is able to communicate the remote changes to supervisor , but supervisor is not informing about the remote to my actual collector

@Asarew
Copy link

Asarew commented Jul 31, 2024

As far as i know, the supervisor starts on a random port just for the bootstrap communication. after that there is no communication between the supervisor and the collector except for restarts.

@MSA0208
Copy link
Author

MSA0208 commented Jul 31, 2024

ok, so you mean that we cant send the remote config received from opamp server to our collector client using supervisor??

if that is the case , how to send the remote config received at supervisor to the collector ?

@Asarew
Copy link

Asarew commented Jul 31, 2024

the supervisor writes the configuration to disk and then restarts the collector

@MSA0208
Copy link
Author

MSA0208 commented Jul 31, 2024

yeah that will be the effective.yaml file.
but when i use that effective.yaml am continuosly observing the restarts at the client side , which is my collector its always in restarting phase

@Asarew
Copy link

Asarew commented Jul 31, 2024

Hmm, maybe check the agent.log file. i'm afraid i don't have a specific answer to you issue 😢

@MSA0208
Copy link
Author

MSA0208 commented Jul 31, 2024

Thanks for pointing out at agent.log, i got the issue, yet to solve it , will do :)

@Asarew
Copy link

Asarew commented Jul 31, 2024

@MSA0208 Your welcome, good luck 👍🏾

@MSA0208
Copy link
Author

MSA0208 commented Jul 31, 2024

I Could solve the issue and my Opamp is working fine for the remote configurations now.

i also tried removing few things from pipeline, i think that would cause the error in the collector

for ex : i have my log and metric pipeline configured and i want to remove the metric pipeline, its not considering the removal

@MSA0208
Copy link
Author

MSA0208 commented Aug 2, 2024

@Asarew
@cforce

Have you anytime tried by updating the existing config.yaml through this opamp remote way? does that work?

because the API on the web console says Additional configurations ?

what sort of changes to the existing config.yaml will be applied like, update, add, delete ?

@Asarew
Copy link

Asarew commented Aug 2, 2024

i can let you know beginning next week, i'm still developing the otel controller and haven't gotten to updates yet. just the initial config push

@MSA0208
Copy link
Author

MSA0208 commented Aug 2, 2024

@Asarew
Sure , Thank you
by then i will try all the possible ways of remote changes to apply and observe the behaviour

@Asarew
Copy link

Asarew commented Aug 7, 2024

Took me a while to fix the controller, but now i can push new changes from the controller down to the supervisor which in turn saves the config to disk and restarts the collector. So for me everything seems to be working fine.

@cforce
Copy link

cforce commented Aug 8, 2024

What was fixed? I don't see any attached pull requests

@MSA0208
Copy link
Author

MSA0208 commented Aug 20, 2024

@cforce nothing much to fix on the opamp, so pull request not required, we have to check with the needed config.yaml for the collector to execute

@Asarew have you anytime verified with https using TLS certs, what kind of certs should we used here ? any idea on the certs to be used for the https communication between these 3 modules

@MSA0208
Copy link
Author

MSA0208 commented Aug 26, 2024

@Asarew Have used the self signed certs generated using openSSL, but i get the error saying
Failed to connect to the OpAMP server {"kind": "extension", "name": "opamp", "error": "tls: first record does not look like a TLS handshake"
Connection failed (tls: first record does not look like a TLS handshake), will retry. {"kind": "extension", "name": "opamp", "client": "ws"}

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Oct 29, 2024
@cforce
Copy link

cforce commented Oct 29, 2024

Is this issue still valid? It seems that the original report by @MSA0208 has been resolved, and now a separate topic about TLS configuration has been raised by @Asarew -. If this new topic is still relevant, it might be best to create a dedicated issue for it. Just to clarify, there’s nothing pending from my side.

@MSA0208
Copy link
Author

MSA0208 commented Oct 29, 2024

Yeah, the issue is solved, we can close the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cmd/opampsupervisor Stale
Projects
None yet
Development

No branches or pull requests

6 participants