Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: Relaytype node has memory leak? #13643

Closed
parholmdahl opened this issue Feb 29, 2024 · 8 comments
Closed

[🐛 Bug]: Relaytype node has memory leak? #13643

parholmdahl opened this issue Feb 29, 2024 · 8 comments

Comments

@parholmdahl
Copy link

parholmdahl commented Feb 29, 2024

What happened?

We are running a couple of nodes in kubernetes. A couple of months ago we also set up a relay node, to relay traffic to browserstack from our ordinary grid. After some time when we start to examin what the pods are doing, we see some strange behaviour on memory and cpu, that we don't se in an "ordinary" type of node, just this one that is relay type.

This is picture from CPU and Memory utilisation on docker;
image

This is on days when the relay node is not used at all.. The graph starts when we restarted the relay node, and from that day we have not run ANY tests on that node.

How can we reproduce the issue?

Building a custom image with this dockerfile: 

ARG NAMESPACE="selenium"
ARG VERSION="4.16"
ARG AUTHORS
FROM ${NAMESPACE}/node-base:${VERSION}
LABEL authors=${AUTHORS}

USER root

EXPOSE 4444

COPY config.toml /opt/selenium/config.toml

USER 1200




TOML FILE LIKE THIS:

[events]
publish = "tcp://URLTOHUB:4442"
subscribe = "tcp://URLTOHUB:4443"

[node]
session-timeout = "30"
override-max-sessions = false
detect-drivers = false
drain-after-session-count = 0
max-sessions = 1

[relay]
url = "https://USERNAME:[email protected]/wd/hub"
status-endpoint = "/status"

# Stereotypes supported by the service. The initial number is "max-sessions", and will allocate 
# that many test slots to that particular configuration
configs = [
  "3", "{\"networkname:applicationName\": \"BS\", \"browserName\": \"Chrome\", \"platformName\": \"iOS\"}",
  "3", "{\"networkname:applicationName\": \"BS\", \"browserName\": \"Chrome\", \"platformName\": \"Android\"}",
  "3", "{\"networkname:applicationName\": \"BS\", \"browserName\": \"MicrosoftEdge\", \"platformName\": \"iOS\"}",
  "3", "{\"networkname:applicationName\": \"BS\", \"browserName\": \"MicrosoftEdge\", \"platformName\": \"Android\"}",
  "3", "{\"networkname:applicationName\": \"BS\", \"browserName\": \"Safari\", \"platformName\": \"iOS\"}"
]

Relevant log output

No relevant logs presented either in relay node nor grid.

This is from relay node; 


2024-02-29 09:15:15,760 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing
2024-02-29T08:15:15.763075769Z 2024-02-29 09:15:15,762 INFO RPC interface 'supervisor' initialized
2024-02-29T08:15:15.763100698Z 2024-02-29 09:15:15,763 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-02-29T08:15:15.763286967Z 2024-02-29 09:15:15,763 INFO supervisord started with pid 8
2024-02-29T08:15:16.766555639Z 2024-02-29 09:15:16,766 INFO spawned: 'xvfb' with pid 9
2024-02-29T08:15:16.773162361Z 2024-02-29 09:15:16,768 INFO spawned: 'vnc' with pid 10
2024-02-29T08:15:16.773176607Z 2024-02-29 09:15:16,770 INFO spawned: 'novnc' with pid 11
2024-02-29T08:15:16.773180270Z 2024-02-29 09:15:16,772 INFO spawned: 'selenium-node' with pid 16
2024-02-29T08:15:16.775035859Z 2024-02-29 09:15:16,774 INFO success: selenium-node entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Appending Selenium options: --session-timeout 30
2024-02-29T08:15:17.118456195Z Tracing is disabled
2024-02-29T08:15:17.118458761Z Selenium Grid Node configuration: 
[events]
2024-02-29T08:15:17.118981784Z publish = "tcp://URL:4442"
2024-02-29T08:15:17.118985213Z subscribe = "tcp://URL:4443"
2024-02-29T08:15:17.118987847Z 
2024-02-29T08:15:17.118990377Z [node]
2024-02-29T08:15:17.118992760Z session-timeout = "30"
2024-02-29T08:15:17.118997516Z override-max-sessions = false
2024-02-29T08:15:17.119012637Z detect-drivers = false
2024-02-29T08:15:17.119014793Z drain-after-session-count = 0
2024-02-29T08:15:17.119016780Z max-sessions = 1
2024-02-29T08:15:17.119018572Z 
2024-02-29T08:15:17.119020568Z [relay]
2024-02-29T08:15:17.119023857Z # Default Appium/Cloud server endpoint
2024-02-29T08:15:17.119027710Z url = "https://USERNAME:[email protected]/wd/hub"
2024-02-29T08:15:17.119029925Z status-endpoint = "/status"
2024-02-29T08:15:17.119031954Z 
2024-02-29T08:15:17.119033870Z # Stereotypes supported by the service. The initial number is "max-sessions", and will allocate 
2024-02-29T08:15:17.119036574Z # that many test slots to that particular configuration
2024-02-29T08:15:17.119038500Z configs = [
2024-02-29T08:15:17.119041664Z   "3", "{\"networkname:applicationName\": \"BS\", \"browserName\": \"Chrome\", \"platformName\": \"iOS\"}",
  "3", "{\"networkname:applicationName\": \"BS\", \"browserName\": \"Chrome\", \"platformName\": \"Android\"}",
2024-02-29T08:15:17.119046364Z   "3", "{\"networkname:applicationName\": \"BS\", \"browserName\": \"MicrosoftEdge\", \"platformName\": \"iOS\"}",
2024-02-29T08:15:17.119048727Z   "3", "{\"networkname:applicationName\": \"BS\", \"browserName\": \"MicrosoftEdge\", \"platformName\": \"Android\"}",
  "3", "{\"networkname:applicationName\": \"BS\", \"browserName\": \"Safari\", \"platformName\": \"iOS\"}"
]Starting Selenium Grid Node...
2024-02-29T08:15:17.511386894Z 09:15:17.509 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
2024-02-29T08:15:17.515036786Z 09:15:17.514 INFO [OpenTelemetryTracer.createTracer] - Using OpenTelemetry for tracing
2024-02-29T08:15:17.650097389Z 09:15:17.649 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://URL:4442 and tcp://URL:4443
2024-02-29T08:15:17.723939368Z 09:15:17.723 INFO [UnboundZmqEventBus.<init>] - Sockets created
2024-02-29T08:15:17.784372513Z 2024-02-29 09:15:17,784 INFO success: xvfb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-02-29T08:15:17.784403030Z 2024-02-29 09:15:17,784 INFO success: vnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-02-29T08:15:17.784406206Z 2024-02-29 09:15:17,784 INFO success: novnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-02-29T08:15:18.726491490Z 09:15:18.725 INFO [UnboundZmqEventBus.<init>] - Event bus ready
2024-02-29T08:15:18.849008637Z 09:15:18.848 INFO [NodeServer.createHandlers] - Reporting self as: http://10.57.1.146:5555
2024-02-29T08:15:18.866471928Z 09:15:18.865 INFO [NodeOptions.getSessionFactories] - Detected 2 available processors
2024-02-29T08:15:18.886347536Z 09:15:18.885 INFO [RelayOptions.getSessionFactories] - Adding relay configs for https://USERNAME:[email protected]/wd/hub
2024-02-29T08:15:18.894427938Z 09:15:18.893 INFO [RelayOptions.lambda$getSessionFactories$1] - Mapping Capabilities {browserName: Chrome, networkname:applicationName: BS, platformName: IOS}, 3 times
2024-02-29T08:15:18.895131174Z 09:15:18.894 INFO [RelayOptions.lambda$getSessionFactories$1] - Mapping Capabilities {browserName: Chrome, networkname:applicationName: BS, platformName: ANDROID}, 3 times
2024-02-29T08:15:18.896039858Z 09:15:18.895 INFO [Rela
yOptions.lambda$getSessionFactories$1] - Mapping Capabilities {browserName: MicrosoftEdge, networkname:applicationName: BS, platformName: IOS}, 3 times
2024-02-29T08:15:18.896699533Z 09:15:18.896 INFO [RelayOptions.lambda$getSessionFactories$1] - Mapping Capabilities {browserName: MicrosoftEdge, networkname:applicationName: BS, platformName: ANDROID}, 3 times
2024-02-29T08:15:18.897365315Z 09:15:18.897 INFO [RelayOptions.lambda$getSessionFactories$1] - Mapping Capabilities {browserName: Safari, networkname:applicationName: BS, platformName: IOS}, 3 times
2024-02-29T08:15:18.972329901Z 09:15:18.971 INFO [Node.<init>] - Binding additional locator mechanisms: relative
2024-02-29T08:15:19.129039778Z 09:15:19.128 INFO [NodeServer$1.start] - Starting registration process for Node http://10.57.1.146:5555
2024-02-29T08:15:19.136872103Z 09:15:19.136 INFO [NodeServer.execute] - Started Selenium node 4.17.0 (revision e52b1be057*): http://10.57.1.146:5555
2024-02-29T08:15:19.902674483Z 09:15:19.896 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
2024-02-29T08:15:20.225305638Z 09:15:20.223 INFO [NodeServer.lambda$createHandlers$2] - Node has been added

Operating System

Kubernetes v1.27.10+k3s2

Selenium version

4.17

What are the browser(s) and version(s) where you see this issue?

NA

What are the browser driver(s) and version(s) where you see this issue?

NA

Are you using Selenium Grid?

4.17

Copy link

@parholmdahl, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@joerg1985
Copy link
Member

I can not see a leak here, the memory consumption is not growing infinite.
Everytime the full GC is running it does shrink to the same size again, i added a green line in the image to visualize this:
image

Or did i miss something?

@parhedberg
Copy link

The memory usage goes up to 3+Gb on an idle node.. Isn't that strange?

Our "regular nodes" does not have this sawtooth pattern on memory and cpu usage, not when they are used nor when they are idle..

@joerg1985
Copy link
Member

@parhedberg i think the regular node had this before b22d08d

Some of these optimizations could be applied to the RelayOptions too, e.g. using the same ImmutableCapabilities in all Slots to hit the fast path in .equals, what is done all the time by the NewSessionRunnable.

Regarding the 3+GB in idle, i think a full GC happens when a certain percentage of memory consumption is reached.
Before this the GC might be lazy and will not instantly clean things ...

@joerg1985
Copy link
Member

@parhedberg I found a httpclient which was not explicit closed, this might be cause for the slow release of memory.
Could you please retest the issue with one of the next nightly builds?

@parholmdahl
Copy link
Author

Redeployed our relay node with the nightly build now, so i will check the status in a day or two!

(and yes, i see now that i use both my old and my new account in this thread.. Sorry for that)...

@parholmdahl
Copy link
Author

After 24 hours, memory and cpu usage looks more natural for the Relay node, as our ordinary nodes do.

Great work!

Copy link

github-actions bot commented Apr 8, 2024

This issue has been automatically locked since there has not been any recent activity since it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants