-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Examples: Fix the interactive test for MacOS users #4779
Examples: Fix the interactive test for MacOS users #4779
Conversation
Signed-off-by: Matej Gera <[email protected]>
Signed-off-by: Matej Gera <[email protected]>
Signed-off-by: Matej Gera <[email protected]>
I encountered this error (after running the interactive test previously i.e, === RUN TestReadOnlyThanosSetup
23:57:38 Starting cadvisor
23:57:46 Ports for container interactive-cadvisor >> Local ports: map[http:8080] Ports available from host: map[http:59667]
23:57:46 Starting monitoring
23:57:49 cadvisor: W1014 18:27:49.933149 1 sysinfo.go:203] Nodes topology is not available, providing CPU topology
23:57:49 cadvisor: W1014 18:27:49.933514 1 sysfs.go:348] unable to read /sys/devices/system/cpu/cpu0/online: open /sys/devices/system/cpu/cpu0/online: no such file or directory
23:57:49 cadvisor: W1014 18:27:50.014122 1 oomparser.go:173] error reading /dev/kmsg: read /dev/kmsg: broken pipe
23:57:49 cadvisor: E1014 18:27:50.014232 1 oomparser.go:149] exiting analyzeLines. OOM events will not be reported.
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:388 msg="No time or size retention was set so using the default time retention" duration=15d
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:426 msg="Starting Prometheus" version="(version=2.27.0, branch=HEAD, revision=24c9b61221f7006e87cd62b9fe2901d43e19ed53)"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:431 build_context="(go=go1.16.4, user=root@f27daa3b3fec, date=20210512-18:04:51)"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:432 host_details="(Linux 5.10.25-linuxkit #1 SMP Tue Mar 23 09:27:39 UTC 2021 x86_64 monitoring (none))"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.176Z caller=main.go:433 fd_limits="(soft=1048576, hard=1048576)"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.177Z caller=main.go:434 vm_limits="(soft=unlimited, hard=unlimited)"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.303Z caller=web.go:540 component=web msg="Start listening for connections" address=:9090
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.316Z caller=main.go:803 msg="Starting TSDB ..."
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.376Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.488Z caller=head.go:741 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.489Z caller=head.go:755 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=45.5µs
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.491Z caller=head.go:761 component=tsdb msg="Replaying WAL, this may take a while"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.506Z caller=head.go:813 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.506Z caller=head.go:818 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=9.6788ms wal_replay_duration=5.0222ms total_replay_duration=17.7563ms
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.548Z caller=main.go:828 fs_type=65735546
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.549Z caller=main.go:831 msg="TSDB started"
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.549Z caller=main.go:957 msg="Loading configuration file" filename=/shared/data/monitoring/prometheus.yml
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.581Z caller=main.go:988 msg="Completed loading of configuration file" filename=/shared/data/monitoring/prometheus.yml totalDuration=26.8292ms remote_storage=79.4µs web_handler=18.8µs query_engine=17.4µs scrape=4.1701ms scrape_sd=360.6µs notify=14.5µs notify_sd=324.1µs rules=16.3µs
23:57:52 monitoring: level=info ts=2021-10-14T18:27:52.581Z caller=main.go:775 msg="Server is ready to receive web requests."
23:57:53 monitoring: level=info ts=2021-10-14T18:27:53.786Z caller=main.go:957 msg="Loading configuration file" filename=/shared/data/monitoring/prometheus.yml
23:57:53 monitoring: level=info ts=2021-10-14T18:27:53.818Z caller=main.go:988 msg="Completed loading of configuration file" filename=/shared/data/monitoring/prometheus.yml totalDuration=32.7781ms remote_storage=16.1µs web_handler=12.1µs query_engine=13.5µs scrape=189.8µs scrape_sd=300.8µs notify=13.8µs notify_sd=15.2µs rules=13.9µs
23:57:54 Ports for container interactive-monitoring >> Local ports: map[http:9090] Ports available from host: map[http:59705]
interactive_test.go:107: interactive_test.go:107:
unexpected error: Prometheus failed to scrape local endpoint after 2 minutes, check monitoring Prometheus logs
23:59:54 Killing monitoring
23:59:56 Killing cadvisor
--- FAIL: TestReadOnlyThanosSetup (154.81s)
FAIL
FAIL command-line-arguments 156.287s
FAIL Also, logs were same as the comment after starting afresh (without existing |
Oof, I think I know this one 🤕. Since we're running the Prometheus instance in Docker, but we want to scrape metrics from the host machine, there needs to be a connection from the container to the host. The framework assumes that the host machine will be reachable on the network's gateway IP, but this does not seem to work in all cases. Since the host metrics never get scraped, the operation times out after 2 minutes. |
Oh! I think this is due to docker networking being different on macOS. An easy fix seems to be just replacing this line with Maybe I can raise PR to |
Absolutely, go for it! 🥳 I'd be happy to review it. |
For the Seems like Changing block plan profile from Edit: Also the status code 137 means probably means it's getting OOM killed in some way, even though docker inspect shows it wasn't,
Maybe this is an issue with thanosbench or I'm doing something wrong? 🤕 |
Works fine on my machine ™️. I'm wondering if you're right about the memory, especially if you're running on MacOS or in other virtualized environment with a memory limit (on my Linux machine, I seem to be limited only by my host machine's available memory). What does |
Yes, I considered that. Docker for macOS limits containers to 2GB memory and half the number of host CPUs by default. So I bumped memory to 8GB and max CPUs. But still the same result. The test creates two containers for each store (one for block plan and one for block gen). The block gen container
|
The I think for now the workaround for the Is there any reason why we do it in docker? Maybe we can not use containers for this and just grab the latest release binary of Also, another issue I came across is that
And seems like I can't pull the image tag used in the interactive_test for Thanos i.e, Everything else seems to work! 💪🏼 |
Hm, but will this bring us any mitigation? This seems to be related to user's setup, i.e. their machine either does not have enough RAM overall or they need to bump up their memory limit, in case they are using Docker on Mac. We will lose the flexibility of having this in Docker and will need to deal with binaries (both for Linux and MacOS on top of that).
Thanks for this! I adjusted the command.
I think we should specify in the docs that you need to first run |
Signed-off-by: Matej Gera <[email protected]>
Yes, I think this would be better too! Or we can leave the choice of profiles up to the users(via a const which can be substituted into the docker commands, default being
Thanks!
Oh, I was unaware! But yes, mentioning this as a step would be great! 🙂 |
Signed-off-by: Matej Gera <[email protected]>
Signed-off-by: Matej Gera <[email protected]>
I made the changes and updated the docs to make it clearer as well. I chose cc @saswatamcode @yeya24 PTAL! |
efficientgo/e2e
to fix the interactive test for MacOS usersSigned-off-by: Matej Gera <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!! It looks awesome now! 💫
@yeya24 when you get a moment, may we get a review / merge here if it looks OK to you? Thanks 😊 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Working great for me.
Changes
This PR includes a couple of fixes, targeted towards MacOS users, namely:
efficientgo/e2e
includes a fix which will allow running the interactive test on MacOScp -t
usage in the code which was not compatible with MacOS version ofcp
.Verification
Perhaps an Apple user could give the final confirmation? 🍏