Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTL-2371 Optimize NIC loop #85

Merged
merged 35 commits into from
Feb 12, 2024
Merged

MTL-2371 Optimize NIC loop #85

merged 35 commits into from
Feb 12, 2024

Conversation

jsollom-hpe
Copy link
Contributor

The hardcoded list of NIC interfaces has two caveats that this change addresses:

  1. If a node has more than 5 interfaces, which does happen, then the interfaces beyond the fifth will never be tried.
  2. If a node has less than 5 interfaces, the script wastes time trying non-existent interfaces.

To address item 1, we will auto-increment an index until we hit a non-existent NIC.
While doing so, we will maintain the start-index used by cms-ipxe today for compute-node optimization. The loop accounts for this by ensuring that the start index is ran first, and only once.

Item 2 is implicitly addressed by item 1's fix. By only trying NICs that exist, we no longer will attempt using non-existent NICs.

There is also some added logic for trying more consistently, ensuring we close/open the NIC before attempting DHCP.

Lastly this uses smaller int sizes, we don't need to allocate 32bit integers for everything.

NOTE: Unfortunately we can't use iflinkwait or --timeout due to the age of the iPXE source code being used, we could pull those helpful features in if MTL-2104 ever merges
(http://github.com/Cray-HPE/ipxe-tpsw-clone/pull/19).

When the max number of attempts are reached, an IPXE> shell will open.
Optionally, a user may interrupt (CTRL+C) the Retrying in X seconds ... message to skip to a shell.

Failed to fetch boot script!
Retrying in 1 seconds ... (CTRL+C to skip)
IPXE failed to retrieve next chain after 1024 attempts or was interrupted.
(type 'exit' to drop into BIOS)
iPXE> exit

 S2600WF
 Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz            @2.10 GHz
 IFWI Version:SE5C620.86B.OR.64.2020.51.2.04.0651.selfboot
 SE5C620.86B.02.01.0013.C0001.121520200651               261120 MB RAM
 Copyright (c) 2006-2020, Intel Corporation

 > Main
 > Advanced
 > Security
 > Server Management
 > Error Manager
 > Boot Manager
 > Boot Maintenance Manager
 > Save & Exit
 > Tls Auth Configuration





                          F10=Save Changes and Exit F9=Reset to Defaults
  ^v=Move Highlight       <Enter>=Select Entry

The output also contains hints, telling the user they may skip interfaces using CTRL+C. These hints print out once per attempt.

iPXE 1.0.0+ -- Open Source Network Boot Firmware -- http://ipxe.org
Features: DNS HTTP HTTPS iSCSI TFTP SRP AoE EFI Menu
Chaining to BSS ...
Chain attempt 1 of 1024
Hint: Press CTRL+C to skip a network interface

Testing

List the environments in which these changes were tested.

Tested on:

  • mug

Test description:

  • Tested configMap generation
  • Tested NCN boots (CN boots not necessary, since the order of NICs tried remained the same)

For the test, I interrupted a few of the attempts in order to emulate a loop failure. In this way, we can observe how only the existing NICs were tired and what happens on success.

>>Start PXE over IPv4.
  Station IP address is 10.1.1.8

  Server IP address is 10.92.100.60
  NBP filename is ipxe.efi
  NBP filesize is 1045280 Bytes
 Downloading NBP file...

  NBP file downloaded successfully.
iPXE initialising devices...ok



iPXE 1.0.0+ -- Open Source Network Boot Firmware -- http://ipxe.org
Features: DNS HTTP HTTPS iSCSI TFTP SRP AoE EFI Menu
Chaining to BSS ...
Chain attempt 1 of 1024
Hint: Press CTRL+C to skip a network interface
Waiting for link-up on net2........ ok
Configuring [dhcp] (net2 a4:bf:01:51:15:0a).................. Connection timed out (http://ipxe.org/4c106092)
Waiting for link-up on net0............ Operation canceled (http://ipxe.org/0b072095)
Configuring [dhcp] (net1 a4:bf:01:51:15:09)... Operation canceled (http://ipxe.org/0b072095)
Waiting for link-up on net3... Operation canceled (http://ipxe.org/0b072095)
Waiting for link-up on net4... Operation canceled (http://ipxe.org/0b072095)
Failed to fetch boot script!
Retrying in 1 seconds ... (CTRL+C to skip)
Chain attempt 2 of 1024
Hint: Press CTRL+C to skip a network interface
Configuring [dhcp] (net2 a4:bf:01:51:15:0a).................. Connection timed out (http://ipxe.org/4c106092)
Waiting for link-up on net0................. Down (http://ipxe.org/38086193)
Waiting for link-up on net1........ ok
Configuring [dhcp] (net1 a4:bf:01:51:15:09).......... ok
net1 IPv4 lease: 10.1.1.8 MAC: a4:bf:01:51:15:09
EFITIME is 2024-01-24 20:18:32
https://api-gw-service-nmn.local/apis/bss/boot/v1/bootscript...X509 chain 0x5bdb7f68 added X509 0x5bdc3e88 "mug.hpc.amslabs.hpecorp.net"
X509 chain 0x5bdb7f68 added X509 0x5be2e160 "Platform CA - L1 (0d39a12c-3e9e-45e8-81ce-d0f576c59742)"
EFITIME is 2024-01-24 20:18:32
X509 chain 0x5bdb7f68 added X509 0x5be2e2a0 "Platform CA (0d39a12c-3e9e-45e8-81ce-d0f576c59742)"
X509 0x5be2e160 "Platform CA - L1 (0d39a12c-3e9e-45e8-81ce-d0f576c59742)" is a root certificate
X509 0x5bdc3e88 "mug.hpc.amslabs.hpecorp.net" successfully validated using issuer 0x5be2e160 "Platform CA - L1 (0d39a12c-3e9e-45e8-81ce-d0f576c59742)"
 ok
http://rgw-vip.nmn/boot-images/545ba744-d5e4-4725-9960-3bbe2d9be8d0/kernel... ok
http://rgw-vip.nmn/boot-images/545ba744-d5e4-4725-9960-3bbe2d9be8d0/initrd... ok

jsl-hpe and others added 30 commits July 12, 2023 10:10
…1689174565

[chore] master -> develop from PR #70 (release/1.11.5)
CASMCMS-8716: Update some dependency patch versions
Update Jenkinsfile to reduce chances of hung or failed builds
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 37 to 38.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](tj-actions/changed-files@v37...v38)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
…ions/changed-files-38

Bump tj-actions/changed-files from 37 to 38
Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v3...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
…s/checkout-4

Bump actions/checkout from 3 to 4
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 38 to 39.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](tj-actions/changed-files@v38...v39)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
…ions/changed-files-39

Bump tj-actions/changed-files from 38 to 39
Bumps [stefanzweifel/git-auto-commit-action](https://github.com/stefanzweifel/git-auto-commit-action) from 4 to 5.
- [Release notes](https://github.com/stefanzweifel/git-auto-commit-action/releases)
- [Changelog](https://github.com/stefanzweifel/git-auto-commit-action/blob/master/CHANGELOG.md)
- [Commits](stefanzweifel/git-auto-commit-action@v4...v5)

---
updated-dependencies:
- dependency-name: stefanzweifel/git-auto-commit-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
…zweifel/git-auto-commit-action-5

Bump stefanzweifel/git-auto-commit-action from 4 to 5
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 39 to 40.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](tj-actions/changed-files@v39...v40)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
…ions/changed-files-40

Bump tj-actions/changed-files from 39 to 40
…698433067

[chore] master -> develop from PR #78 (hotfix/1.11.6)
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 40 to 41.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](tj-actions/changed-files@v40...v41)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
…ions/changed-files-41

Bump tj-actions/changed-files from 40 to 41
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 41 to 42.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](tj-actions/changed-files@v41...v42)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
…ions/changed-files-42

Bump tj-actions/changed-files from 41 to 42
The hardcoded list of NIC interfaces has two caveats that this change
addresses:
1. If a node has more than 5 interfaces, which does happen, then the
   interfaces beyond the fifth will never be tried.
2. If a node has less than 5 interfaces, the script wastes time trying
   non-existent interfaces.

To address item 1, we will auto-increment an index until we hit a
non-existent NIC.
While doing so, we will maintain the start-index used by cms-ipxe today
for compute-node optimization. The loop accounts for this by ensuring
that the start index is ran first, and only once.

Item 2 is implicitly addressed by item 1's fix. By only trying NICs that
exist, we no longer will attempt using non-existent NICs.

There is also some added logic for trying more consistently, ensuring we
close/open the NIC before attempting DHCP.

Lastly this uses smaller `int` sizes, we don't need to allocate 32bit
integers for everything.

NOTE: Unfortunately we can't use `iflinkwait` or `--timeout` due to the age of
the iPXE source code being used, we could pull those helpful features in
if MTL-2104 ever merges
(http://github.com/Cray-HPE/ipxe-tpsw-clone/pull/19).
@jsollom-hpe jsollom-hpe requested a review from a team as a code owner January 27, 2024 00:20
Copy link

sonarqubecloud bot commented Feb 8, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@jsollom-hpe jsollom-hpe merged commit 5c49f7e into master Feb 12, 2024
9 of 10 checks passed
mharding-hpe added a commit that referenced this pull request Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants