Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DP not up (sometimes) when reloading config through SIGHUP #4568

Open
TVKain opened this issue Sep 12, 2024 · 12 comments
Open

DP not up (sometimes) when reloading config through SIGHUP #4568

TVKain opened this issue Sep 12, 2024 · 12 comments

Comments

@TVKain
Copy link

TVKain commented Sep 12, 2024

image

  • When reloading the config through SIGHUP, faucet sometimes logs out DP not up and new flows are not sent down to the switch
  • This behavior seems to be inconsistent

Here is the capture of the traffic between the switch and the controller when "DP not up"
image

Faucet version: 1.10.11

@gizmoguy
Copy link
Member

Would need a bit more information to debug this, I notice your capture is started after the log message so any change to TCP state of the control channel will be missing.

But does the switch eventually recover and have the correct flows programmed? dp not up isn't necessarily a problem, faucet is just saying the switch reset its control channel state.

@TVKain
Copy link
Author

TVKain commented Sep 19, 2024

Steps to reproduce

  1. A process send SIGHUP to the faucet controller every 5 seconds
  2. Faucet controller running listening to port 6653
  3. One Open vSwitch switch connected to the faucet controller
ovs-vsctl set-controller br-f1 tcp:127.0.0.1:6653
  1. Faucet config file contains 5 VLANs each with 3 ACL rules
  2. New flows are sent down to OVS
  3. Populate faucet config file with 3000 VLANs each with 3 ACL rules
  4. Faucet log shows DP down and new flows aren't sent down to the OVS switch

dp_down

PCAP files

The pcap files contains the captured packets starting at the moment the config file has 5 VLANs and the flows for those are sent down to the switch (everything was fine up until this point) and end after faucet log shows DP down
faucet.zip

Versions

  • Faucet 1.10.11
  • Open vSwitch 3.3.0

@gizmoguy
Copy link
Member

gizmoguy commented Sep 25, 2024

Thanks for the additional information.

This will be caused by the default openflow hello timers for openvswitch being too low for the number of flow rules you want to push and openvswitch timing out the connection.

You need to tune the following ovsdb options:

  • inactivity_probe
  • controller_rate_limit
  • controller_burst_limit

There is some documentation here on how to do that:

https://bugs.launchpad.net/neutron/+bug/1817022

Also note there was a bug in certain versions of OVS (introduced in v2.12.0 and fixed in v3.3.0) where these configuration values weren't always honored, so make sure you aren't running an affected version, see details on this mailing list thread:

https://mail.openvswitch.org/pipermail/ovs-dev/2023-September/408205.html

@TVKain
Copy link
Author

TVKain commented Sep 26, 2024

Thank you for the reply, I will try it ASAP.

Though I do have an additional question, I have not dug too much into the source code yet but I notice that, whenever there's changes in VLAN or Port, faucet "cold" starts, in other situations like changes to ACLs, faucet "warm" starts. Could you tell me why that is ?

Also could you clarify the behavior of "cold" starting vs "warm" starting ?

Sidenote:

  • I have tried setting the inactivity_probe to 3000000 and the error still persists
  • I tried edited out the part which I believed to cause faucet to "cold" start and the error seems to disappeared, and flows are sent down the switch.
  • It happens even with few VLANs

File: valve.py
Function: _apply_config_changes(self, new_dp, changes, valves=None)

        # # If pipeline or all ports changed, default to cold start.
        # if self._pipeline_change(new_dp):
        #     self.dp_init(new_dp, valves)
        #     return restart_type, ofmsgs
        #
        # if all_ports_changed:
        #     self.logger.info("all ports changed")
        #     self.dp_init(new_dp, valves)
        #     return restart_type, ofmsgs

@TVKain
Copy link
Author

TVKain commented Sep 30, 2024

Another follow up to this
This is the osken-manager log when the incident happened
image

This is the osken-manager log when "cold" reload works normally
image

This is the Open vSwitch logs in both cases
image

From the logs, I see that the error happens because an event is missing

connected socket:<eventlet.greenio.base.GreenSocket....

Could this be the reason ?

@gizmoguy
Copy link
Member

gizmoguy commented Oct 9, 2024

Cold start = faucet deletes all the flows in the openvswitch flow table and readds them

Warm start = faucet just applies a minimal diff to the flow table in order to implement the new behaviour represented by the config change (e.g add a new ACL)

From looking at some of your earlier logs it appears you are reloading faucet while it is in the middle of a cold start, have you tried only reloading it after the cold start finishes, does that help?

@TVKain
Copy link
Author

TVKain commented Oct 10, 2024

Regarding the Open vSwitch version, I'm using Open vSwitch 3.3.0

I also suspect reloading faucet while it's cold starting might be the problem, however I need to reload the config programmatically and I don't know a way to hook into faucet to check if cold start is in progress or not so I just probe the reload every 10 seconds.

For more context, I'm trying to use Faucet to control a single Open vSwitch switch br-f (this is what I name it) that will act as a multi-tenant firewall integrated with OpenStack br-ex bridge.

Also could you clarify why changes to VLAN needs cold starting ? Why all flows are deleted then re-add ?

I also want to know that is there any risk when changing Faucet to only "warm" starts even when the intended purpose is to "cold" starts when there are changes to VLAN ?

From what I've tested so far nothing dangerous has happend.

@gizmoguy
Copy link
Member

You can ask faucet which config file it currently has loaded via the prometheus interface, and only send it one HUP each time your configuration file changes, rather than reloading faucet every 10 seconds:

$ curl localhost:9302/metrics | grep faucet_config_hash_info
# HELP faucet_config_hash_info file hashes for last successful config
# TYPE faucet_config_hash_info gauge
faucet_config_hash_info{config_files="/etc/faucet/faucet.yaml",error="",hashes="ce1dfaa2df25e0643001fba754799238a576f328893c871c7592714cbec9fef6"} 1.0

For warm reloads we need to implement specific warm reload behaviour for every possible change you can make in the config file, as we need to compute a difference in the openflow rules and implement this difference as a series of flow adds/deletes/modifies. For things we haven't implemented warm reload for we revert back to a cold restart which will always work.

We would of course be open to contributions to implement warm restart for vlan change, you can find the code and relevant TODO here: https://github.com/faucetsdn/faucet/blob/main/faucet/valve.py#L1672-L1677

@TVKain
Copy link
Author

TVKain commented Oct 31, 2024

Thank you for your reply, I'll look into it.

@gizmoguy
Copy link
Member

gizmoguy commented Nov 6, 2024

Actually another idea for you that might be a bit easier, faucet has an environment variable FAUCET_CONFIG_STAT_RELOAD, if you set this to true, faucet will monitor your configuration file for changes and automatically reload when it changes:

https://docs.faucet.nz/en/latest/configuration.html#environment-variables

@TVKain
Copy link
Author

TVKain commented Nov 9, 2024

That seems nice but I do have one question, what is the behavior of Faucet when for example the configuration file gets written into while it's automatically reloading ?

@gizmoguy
Copy link
Member

I'm not sure if we have a test for that specific case, so not sure what would happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants