Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old VLAN tag remains on VF when NAD has no VLAN field. #25

Closed
alonSadan opened this issue Feb 25, 2021 · 18 comments
Closed

Old VLAN tag remains on VF when NAD has no VLAN field. #25

alonSadan opened this issue Feb 25, 2021 · 18 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@alonSadan
Copy link
Contributor

alonSadan commented Feb 25, 2021

What happened?

Once a SRIOV VF gets a VLAN tag assigned to,the vlan tag vlaue is kept,
even though the vmi/pod was connected to a network with no VLAN attribute defined on it's NetworkAttachmentDefenition.

What did you expect to happen?

When VLAN not specified - the vlan tag will be removed from the VF.
When VLAN equals 0 - the vlan tag should be with value 0.

What are the minimal steps needed to reproduce the bug?

On an SRIOV setup:

  1. set a VF with a vlan tag
  2. create a NAD with no vlan attribute for example:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: "sriov-no-vlan"
  namespace: "default"
  annotations:
    k8s.v1.cni.cncf.io/resourceName: "openshift.io/sriov_net"
spec:
  config: >-
    { "cniVersion": "0.3.1", "name": "sriov-network", "type": "sriov", "link_state":
    "enable", "ipam": {}}
  1. create a pod (could be vmi) and connect it to the VLANless network. I used:
apiVersion: v1
kind: Pod
metadata:
  name: example-pod-no-vlan-4
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-no-vlan 
spec:
  containers:
  - name: example-pod
    command: ["/bin/bash", "-c", "sleep 2000000000000"]
    image: centos/tools
  1. make sure the VF that gets assigned is the one that got a vlan, I just gave all VFs the same random VLAN tag.

Anything else we need to know?

When The vlan attribute exits on the NAD, but holds a value of 0 the VLAN does get removed from the VF.

Component Versions

Please fill in the below table with the version numbers of applicable components used.

Component Version
SR-IOV CNI Plugin 4.4
Multus
SR-IOV Network Device Plugin 4.4
Kubernetes kind-k8s-1.17
OS CentOS Linux 7

Config Files

Config file locations may be config dependent.

CNI config (Try '/etc/cni/net.d/')
Device pool config file location (Try '/etc/pcidp/config.json')
Multus config (Try '/etc/cni/multus/net.d')
Kubernetes deployment type ( Bare Metal, Kubeadm etc.)
Kubeconfig file
SR-IOV Network Custom Resource Definition

Logs

SR-IOV Network Device Plugin Logs (use kubectl logs $PODNAME)
Multus logs (If enabled. Try '/var/log/multus.log' )
Kubelet logs (journalctl -u kubelet)
@zshi-redhat
Copy link
Contributor

@alonSadan so you would like to see the VF vlan tag be set to 0 when not explicitly specified in net-attach-def?

@alonSadan
Copy link
Contributor Author

alonSadan commented Mar 8, 2021

As @EdDev said here, and @oshoval said here, there is a difference between vlan=0 and no vlan at all. We would like it to be reflected when applying the NAD:

  • When vlan is set to 0 on the NAD , the VF will have vlan=0.
  • When vlan is not specified in the NAD , the VF will have no vlan tag at all.

Is this possible?

@zshi-redhat
Copy link
Contributor

@adrianchiris @martinkennelly Is there any difference between setting vlan=0 and not setting any vlan on VFs?

@zshi-redhat
Copy link
Contributor

As @EdDev said here, and @oshoval said here, there is a difference between vlan=0 and no vlan at all. We would like it to be reflected when applying the NAD:

* When vlan is net to 0 on the NAD , the VF will have vlan=0.

* When vlan is not specified in the NAD , the VF will have no vlan tag at all.

I think these two are already satisfied in sriov-cni, isn't it?
However, I noticed that in SR-IOV Operator, it adds default vlan=0 in net-attach-def if not explicitly specified.

@EdDev
Copy link

EdDev commented Mar 9, 2021

* When vlan is net to 0 on the NAD , the VF will have vlan=0.

* When vlan is not specified in the NAD , the VF will have no vlan tag at all.

I think these two are already satisfied in sriov-cni, isn't it?

Per my understanding of @alonSadan findings, if the VLAN is not specified, the VF is not touching that property, therefore leaving what was there before. This is a different semantic from what was mentioned above.

However, I noticed that in SR-IOV Operator, it adds default vlan=0 in net-attach-def if not explicitly specified.

@alonSadan
Copy link
Contributor Author

As @EdDev said here, and @oshoval said here, there is a difference between vlan=0 and no vlan at all. We would like it to be reflected when applying the NAD:

* When vlan is net to 0 on the NAD , the VF will have vlan=0.

* When vlan is not specified in the NAD , the VF will have no vlan tag at all.

I think these two are already satisfied in sriov-cni, isn't it?
However, I noticed that in SR-IOV Operator, it adds default vlan=0 in net-attach-def if not explicitly specified.

From what I saw, When VLAN is not specified in the NAD, the VLAN attribute of the VF will not change, but we expect it to be removed.

@alonSadan alonSadan changed the title VLAN tags get restored on SRIOV VF when no vlan feild exists in NAD Old VLAN tag remains on VF when NAD has no VLAN field. Mar 9, 2021
@martinkennelly
Copy link
Contributor

@adrianchiris @martinkennelly Is there any difference between setting vlan=0 and not setting any vlan on VFs?

@zshi-redhat No functional difference for X710 and E810s.
@alonSadan What driver and hardware are you using?

@adrianchiris
Copy link
Contributor

adrianchiris commented Mar 9, 2021

@adrianchiris @martinkennelly Is there any difference between setting vlan=0 and not setting any vlan on VFs?

vlan 0 == no vlan for mlnx NICs

However keep in mind that for CNI cmdAdd call, if no vlan is specified then it means the existing vlan is expected to be used(whatever it may be). In this case no vlan operations are made on the VF on cmdDel call.

If vlan is specified, on cmdAdd call, that vlan will be set on the VF and the original vlan as was prior to the set will be restored on cmdDel call.

@zshi-redhat
Copy link
Contributor

As @EdDev said here, and @oshoval said here, there is a difference between vlan=0 and no vlan at all. We would like it to be reflected when applying the NAD:

* When vlan is net to 0 on the NAD , the VF will have vlan=0.

* When vlan is not specified in the NAD , the VF will have no vlan tag at all.

I think these two are already satisfied in sriov-cni, isn't it?
However, I noticed that in SR-IOV Operator, it adds default vlan=0 in net-attach-def if not explicitly specified.

From what I saw, When VLAN is not specified in the NAD, the VLAN attribute of the VF will not change,

Ok, so you are expecting sriov-cni to set vlan=0 on VF when vlan field is not specified in NAD.
When sriov-cni is used in SR-IOV Operator, NAD is created with vlan=0 by default via SR-IOV Operator API.
When sriov-cni is not used w/ SR-IOV Operator, it by default will ignore the vlan field if not configured, meaning vlan=0 is not set on VF device.

Given that Martin and Adrian had confirmed that vlan=0 is equal to no vlan, I think it doesn't hurt if we change sriov-cni to always set vlan=0 when vlan is not configured in NAD.

@adrianchiris
Copy link
Contributor

adrianchiris commented Mar 10, 2021

an "unspecified" VLAN in NAD IMO does not necessarily mean vlan 0. I expect it to mean "unspecified" as in the CNI plugin should not concern itself with VLAN configuration.

@EdDev
Copy link

EdDev commented Mar 10, 2021

an "unspecified" VLAN in NAD IMO does not necessarily mean vlan 0. I expect it to mean "unspecified" as in the CNI plugin should not concern itself with VLAN configuration.

A VLAN or lack of it defines a network, it is unreasonable to get a random one. When one does not specify a VLAN, he explicitly says that it expects it to have none.
Saying something like: Please assign me a VF, no matter what VLAN it is, makes no sense.

@adrianchiris
Copy link
Contributor

A VLAN or lack of it defines a network, it is unreasonable to get a random one.

It is the VLAN as was configured in the system it is not "random".

@EdDev
Copy link

EdDev commented Mar 10, 2021

A VLAN or lack of it defines a network, it is unreasonable to get a random one.

It is the VLAN as was configured in the system it is not "random".

That is not reasonable IMO, the CNI assigns and releases VF/s and basically controls its settings.
If the NAD is not specifying a VLAN, I conclude, as with classical networking, that there is non. Asking the operator not to trust that because the NAD is "merged" with how the VF/s have been setup by the HW admin is unreasonable.
There is a meaning of no-VLAN and such a semantic behavior is both confusing and contradicts other implementations (e.g. the SR-IOV Operator).

At KubeVirt we have been experiencing flakiness in tests due to this. The VF/s assigned had on them leaked VLAN/s (probably from other tests that used them, and somehow the release has not restored VLAN 0), and at some point such VF/s have been assigned to pods who expected no VLAN.

@alonSadan
Copy link
Contributor Author

alonSadan commented Mar 24, 2021

@EdDev @adrianchiris @zshi-redhat

From what I understand from this discussion, there are two things needs to be done in order to close this issue:

  1. update the code so that when no VLAN is provided, the VLAN will be removed (treat it like VLAN ID = 0)
  2. give a note about (1) in the docs.

Any objections?

@alonSadan
Copy link
Contributor Author

Hi @zshi-redhat @adrianchiris @martinkennelly @EdDev
I added a PR to:

  1. update the code so that when no VLAN is provided, the VLAN will be removed (treat it like VLAN ID = 0)
  2. give a note about (1) in the docs.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 11, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 11, 2021
@alonSadan
Copy link
Contributor Author

U/S PR got merged , and the 4.9 D/S release already include the solution for this issue.
I think this one can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants