Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Selenium Grid in case multiple scaler triggers are activate #6437

Merged
merged 6 commits into from
Dec 24, 2024

Conversation

VietND96
Copy link
Contributor

@VietND96 VietND96 commented Dec 23, 2024

Implement strictly the condition to trigger the scaler in complex cases that Selenium Grid includes autoscaling Nodes (with multiple scaledObject with different triggers metadata), non-autoscaling Nodes, relay Nodes, and so on.

In general, update the scaler logic should be aligned with Selenium Grid core in DefaultSlotMatcher - which can be inspected https://github.com/SeleniumHQ/selenium/blob/trunk/java/src/org/openqa/selenium/grid/data/DefaultSlotMatcher.java

It is also recommended to set the browserName, browserVersion, and platformName to align across request capabilities, Node stereotype and scaler trigger params to have correct scaling behavior.

In terms of SlotMatcher in Grid, there is no value latest for the capability browserVersion. So, remove that value latest and keep the default value EMPTY.
The default value of platformName is also updated to EMPTY. Let the user set the value that matches with platformName in Node stereotype and request capability that they want to scale by this ScaledObject/ScaledJob.

So, with the default scaler metadata (only browserName is mandatory to set)

  triggers:
    - type: selenium-grid
      metadata:
        url: 'http://selenium-hub:4444/graphql'
        browserName: 'chrome'

It will scale up Node for the request (in Python binding example)

options = ChromeOptions()
driver = webdriver.Remote(options=options, command_executor=SELENIUM_GRID_URL)

For example, the client sends the request with platfromName capability set

options = ChromeOptions()
options.set_capability('platformName', 'Linux')
driver = webdriver.Remote(options=options, command_executor=SELENIUM_GRID_URL)

It is equivalent to scaler trigger params settings.

  triggers:
    - type: selenium-grid
      metadata:
        url: 'http://selenium-hub:4444/graphql'
        browserName: 'chrome'
        platformName: 'linux'

For example, the client sends the request.

options = ChromeOptions()
options.set_capability('platformName', 'Linux')
options.set_capability('browserVersion', '131.0')
driver = webdriver.Remote(options=options, command_executor=SELENIUM_GRID_URL)

The scaler trigger param should be aligned as

  triggers:
    - type: selenium-grid
      metadata:
        url: 'http://selenium-hub:4444/graphql'
        browserName: 'chrome'
        platformName: 'linux'
        browserVersion: '131.0'

Most people are using the popular browser node image from the project docker-selenium. In default Node stereotype, the short browser version is set for browserVersion, it might cause breaking backward compatiblity, or request without browserVersion could not trigger scaler anymore. Via SeleniumHQ/docker-selenium#2520, you can set SE_NODE_BROWSER_VERSION as empty to unset the default version (take newer docker image tag from that project, which contains the change onwards).

This also avoids multiple scalers, for example same browserName combines multiple browserVersion could have issue with overlapping in counting number of ongoing and pending sessions belonging to each scaler trigger.

In terms of Selenium Grid v4, it has a kind of Relay Node, which has capabilities to forward commands to the Appium instance. With this Node type, the browserName is optional. So, updating the trigger param browserName to optional.

Checklist

  • Tests have been added
  • Changelog has been updated and is aligned with our changelog requirements
  • A PR is opened to update the documentation on (repo) (if applicable)
  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #

Relates to #

@VietND96 VietND96 requested a review from a team as a code owner December 23, 2024 01:26
@VietND96 VietND96 force-pushed the main branch 3 times, most recently from 0153063 to 8615239 Compare December 23, 2024 06:07
…sion triggers are active

Signed-off-by: Viet Nguyen Duc <[email protected]>
@VietND96 VietND96 changed the title fix: Selenium Grid in case multiple browserVersion scalers are active fix: Selenium Grid in case multiple scaler triggers are activate Dec 23, 2024
Signed-off-by: Viet Nguyen Duc <[email protected]>
@VietND96 VietND96 mentioned this pull request Dec 23, 2024
24 tasks
Signed-off-by: Viet Nguyen Duc <[email protected]>
@JorTurFer
Copy link
Member

JorTurFer commented Dec 23, 2024

/run-e2e selenium
Update: You can check the progress here

@JorTurFer
Copy link
Member

JorTurFer commented Dec 23, 2024

/run-e2e selenium
Update: You can check the progress here

@VietND96
Copy link
Contributor Author

Can you try e2e run one more time, I guess pull image could take time or reach pull limit rate...Here is result in my local

EXECUTION SUMMARY
##############################################
##############################################
Passed tests:
        Execution of tests/scalers/selenium/selenium_test.go, has passed after "one" attempts

@JorTurFer
Copy link
Member

It looks like issue reaching the hub ->

scale_handler	error getting scale decision	***"scaledObject.Namespace": "selenium-test-ns", "scaledObject.Name": "chrome-selenium-test-so", "scaler": "seleniumGridScaler", "error": "error requesting selenium grid endpoint: Post \"[http://selenium-hub.selenium-test-ns:4444/graphql\](http://selenium-hub.selenium-test-ns:4444/graphql/)": dial tcp ***0.0.***5***.***4***:4444: connect: connection refused"***
github.com/***/keda/v***/pkg/scaling.(*scaleHandler).getScalerState
	/workspace/pkg/scaling/scale_handler.go:780
github.com/***/keda/v***/pkg/scaling.(*scaleHandler).getScaledObjectState.func***

I'm checking another issue in my local, once I finish I'll try to check this too

@VietND96
Copy link
Contributor Author

@JorTurFer, I also done other tests in downstream repos. Let's rerun the e2e job here to see any luck :)

@JorTurFer
Copy link
Member

JorTurFer commented Dec 24, 2024

/run-e2e selenium
Update: You can check the progress here

@JorTurFer
Copy link
Member

Now I don't see connection issues trying to reach the hub, but it doesn't pull the values from the hub. I see 6 items in the queue:
image

But KEDA doesn't scale out the workload and no error is shown either

@VietND96
Copy link
Contributor Author

Thanks for your info, I am checking it in local again now

@VietND96
Copy link
Contributor Author

The cluster can be accessed, right? how about when kubectl get deployment, can it list 3 deploy resources, since I updated it to use image selenium/node-chrome:nightly (similar for firefox, edge). Not sure the deployment can be created?

@JorTurFer
Copy link
Member

The cluster can be accessed, right? how about when kubectl get deployment, can it list 3 deploy resources, since I updated it to use image selenium/node-chrome:nightly (similar for firefox, edge). Not sure the deployment can be created?

let me check

@JorTurFer
Copy link
Member

Currently, I can see 6 items in the queue:
image

And I have tried to scale out manually the workloads to 1, and KEDA has scaled in to 0, so the current value calculated is 0. The deployments are correct

@JorTurFer
Copy link
Member

I've added a print, and this is the content of the Hub response's body (the missing 2 are most probably because when I started the pod manually, they were processed)

{
  "data": {
    "grid": {
      "sessionCount": 0,
      "maxSession": 0,
      "totalSlots": 0
    },
    "nodesInfo": {
      "nodes": [
      ]
    },
    "sessionsInfo": {
      "sessionQueueRequests": [
        "{\n  \"acceptInsecureCerts\": true,\n  \"browserName\": \"firefox\",\n  \"moz:firefoxOptions\": {\n    \"args\": [\n      \"-remote-debugging-port\",\n      \"40627\"\n    ]\n  }\n}",
        "{\n  \"acceptInsecureCerts\": true,\n  \"browserName\": \"chrome\"\n}",
        "{\n  \"acceptInsecureCerts\": true,\n  \"browserName\": \"firefox\",\n  \"moz:firefoxOptions\": {\n    \"args\": [\n      \"-remote-debugging-port\",\n      \"40947\"\n    ]\n  }\n}",
        "{\n  \"acceptInsecureCerts\": true,\n  \"browserName\": \"chrome\"\n}"
      ]
    }
  }
}

@JorTurFer
Copy link
Member

JorTurFer commented Dec 24, 2024

Now I'm debugging this function:

// This function checks if the request capabilities match the scaler metadata
func checkRequestCapabilitiesMatch(request Capability, browserName string, browserVersion string, _ string, platformName string) bool {
	// Check if browserName matches
	browserNameMatch := request.BrowserName == "" && browserName == "" ||
		strings.EqualFold(browserName, request.BrowserName)

	// Check if browserVersion matches
	browserVersionMatch := (request.BrowserVersion == "" && browserVersion == "") ||
		(request.BrowserVersion == "stable" && browserVersion == "") ||
		(strings.HasPrefix(browserVersion, request.BrowserVersion) && request.BrowserVersion != "" && browserVersion != "")

	// Check if platformName matches
	platformNameMatch := request.PlatformName == "" && platformName == "" ||
		strings.EqualFold(platformName, request.PlatformName)

	return browserNameMatch && browserVersionMatch && platformNameMatch
}

And the platformName doesn't match

	// Check if platformName matches
	platformNameMatch := request.PlatformName == "" && platformName == "" ||
		strings.EqualFold(platformName, request.PlatformName)

So the result of the function is false. The current value for request.PlatformName is "". As I understand, it means that the isn't any platform requirement, so the logic should be something like (and most probably, this is the same for the other 2 checks)

	// Check if platformName matches
	platformNameMatch := request.PlatformName == "" ||
		strings.EqualFold(platformName, request.PlatformName)

WDYT? I've tested that change locally and it works

@VietND96
Copy link
Contributor Author

Yes, that is reason I also updated in deployment container env

        env:
        - name: SE_NODE_BROWSER_VERSION
          value: ''
        - name: SE_NODE_PLATFORM_NAME
          value: ''

To override the Node stereotypes be aligned with scaler metadata

@JorTurFer
Copy link
Member

Does it make sense to update the logic to not check the platform/browser/version if the request doesn't specify it? I mean, from my naive pov, if the request doesn't define those values it's because any node can handle that request.

@VietND96
Copy link
Contributor Author

I changed imagePullPolicy: Always, I guess it might used the cache images without changes yet

@JorTurFer
Copy link
Member

JorTurFer commented Dec 24, 2024

/run-e2e selenium
Update: You can check the progress here

@JorTurFer
Copy link
Member

I changed imagePullPolicy: Always, I guess it might used the cache images without changes yet

I'm checking it locally right now

@JorTurFer
Copy link
Member

The problem that I see is that the nodes aren't starting, so any change applied in nodes isn't taken into account, at least until the node starts. As all the nodes are deployed scaled to 0, any configuration in the node won't be reflected in the hub

@JorTurFer
Copy link
Member

yeah, locally it still not working, 6 items and no nodes
image

@VietND96
Copy link
Contributor Author

Without set platfromName as EMPTY in Node config, it looks like it assumes as Windows by default in Grid...then the scaler scale it down to 0, due to mismatch with metadata

@VietND96
Copy link
Contributor Author

Give me some time, I will fix the test need to enforce platformName is Linux same as Node stereotype and scaler metadata

@JorTurFer
Copy link
Member

I've written you by Slack, maybe it'll be faster :)

@JorTurFer
Copy link
Member

JorTurFer commented Dec 24, 2024

/run-e2e selenium
Update: You can check the progress here

@JorTurFer JorTurFer merged commit 5db5a3a into kedacore:main Dec 24, 2024
20 checks passed
JorTurFer pushed a commit to JorTurFer/keda that referenced this pull request Dec 24, 2024
…acore#6437)

* fix: Selenium Grid scaler avoids overlapping when multiple browserVersion triggers are active

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Update CHANGELOG

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Fix e2e template test

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Change imagePullPolicy to Always to take latest change

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Update platformName default value as empty

Signed-off-by: Viet Nguyen Duc <[email protected]>

---------

Signed-off-by: Viet Nguyen Duc <[email protected]>
Signed-off-by: Jorge Turrado <[email protected]>
JorTurFer added a commit that referenced this pull request Dec 24, 2024
* fix: CVE-2024-45337 (#6422)

Signed-off-by: Jorge Turrado <[email protected]>

* :fix: ensure consistent JSON log format for automaxprocs (#6335)

* fix: ensure consistent JSON log format for automaxprocs

Signed-off-by: Omer Aplatony <[email protected]>

* moved to Unreleased

Signed-off-by: Omer Aplatony <[email protected]>

---------

Signed-off-by: Omer Aplatony <[email protected]>
Signed-off-by: Jorge Turrado <[email protected]>

* Upgrade to the latest new-relic-go v2 client (#6325)

Signed-off-by: Jorge Turrado <[email protected]>

* refactor: replace experimental `maps` and `slices` with stdlib (#6372)

Signed-off-by: Eng Zer Jun <[email protected]>
Signed-off-by: Jan Wozniak <[email protected]>
Co-authored-by: Jan Wozniak <[email protected]>
Signed-off-by: Jorge Turrado <[email protected]>

* fix: Selenium Grid scaler exposes sum of pending and ongoing sessions to KDEA (#6368)

Signed-off-by: Jorge Turrado <[email protected]>

* fix: scaledjobs stuck as not ready (#6329)

since code was missing for setting a scaledjob as ready it was stuck as unready if there ever was a problem

This is a fix for a regression in #5916

Signed-off-by: Mårten Svantesson <[email protected]>
Signed-off-by: Jorge Turrado <[email protected]>

* fix: Paused ScaledObject count is reported correctly after operator restart (#6322)

Signed-off-by: Jorge Turrado <[email protected]>

* update changelog

Signed-off-by: Jorge Turrado <[email protected]>

* Bump go and deps (#6434)

* Bump go and deps

Signed-off-by: Jorge Turrado <[email protected]>

* bump golang.org/x/net

Signed-off-by: Jorge Turrado <[email protected]>

* Add missing permissions to code-scaners

Signed-off-by: Jorge Turrado <[email protected]>

* update release tracker

Signed-off-by: Jorge Turrado <[email protected]>

* fix aws scaler

Signed-off-by: Jorge Turrado <[email protected]>

* Pass theregion

Signed-off-by: Jorge Turrado <[email protected]>

* Pass theregion

Signed-off-by: Jorge Turrado <[email protected]>

* fix aws tests

Signed-off-by: Jorge Turrado <[email protected]>

* fix aws tests

Signed-off-by: Jorge Turrado <[email protected]>

* Update AWS admission probes

Signed-off-by: Jorge Turrado <[email protected]>

---------

Signed-off-by: Jorge Turrado <[email protected]>
Signed-off-by: Jorge Turrado <[email protected]>

* commit last vendored deps

Signed-off-by: Jorge Turrado <[email protected]>

* fix: Selenium Grid in case multiple scaler triggers are activate (#6437)

* fix: Selenium Grid scaler avoids overlapping when multiple browserVersion triggers are active

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Update CHANGELOG

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Fix e2e template test

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Change imagePullPolicy to Always to take latest change

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Update platformName default value as empty

Signed-off-by: Viet Nguyen Duc <[email protected]>

---------

Signed-off-by: Viet Nguyen Duc <[email protected]>
Signed-off-by: Jorge Turrado <[email protected]>

* add missing change

Signed-off-by: Jorge Turrado <[email protected]>

---------

Signed-off-by: Jorge Turrado <[email protected]>
Signed-off-by: Omer Aplatony <[email protected]>
Signed-off-by: Eng Zer Jun <[email protected]>
Signed-off-by: Jan Wozniak <[email protected]>
Signed-off-by: Mårten Svantesson <[email protected]>
Signed-off-by: Jorge Turrado <[email protected]>
Signed-off-by: Viet Nguyen Duc <[email protected]>
Co-authored-by: Omer Aplatony <[email protected]>
Co-authored-by: smcavallo <[email protected]>
Co-authored-by: Eng Zer Jun <[email protected]>
Co-authored-by: Jan Wozniak <[email protected]>
Co-authored-by: Viet Nguyen Duc <[email protected]>
Co-authored-by: Mårten Svantesson <[email protected]>
chiragbhatia8 pushed a commit to chiragbhatia8/keda that referenced this pull request Jan 9, 2025
…acore#6437)

* fix: Selenium Grid scaler avoids overlapping when multiple browserVersion triggers are active

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Update CHANGELOG

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Fix e2e template test

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Change imagePullPolicy to Always to take latest change

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Update platformName default value as empty

Signed-off-by: Viet Nguyen Duc <[email protected]>

---------

Signed-off-by: Viet Nguyen Duc <[email protected]>
Signed-off-by: Chirag Bhatia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants