-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with built in load balancer #1216
Comments
Thanks for the details @nemonik. Are you launching k3s with the |
Thank you @erikwilson. Yes, I do launch via the
Where https://github.com/nemonik/hands-on-DevOps/blob/master/ansible_extra_vars.rb#L26 On the agent
the networking between the vagrants is define as the following in the Vagrantfile https://github.com/nemonik/hands-on-DevOps/blob/master/Vagrantfile#L159
and https://github.com/nemonik/hands-on-DevOps/blob/master/Vagrantfile#L183
|
I did try to resolve the problem by separating an app's LoadBalancer service resource out into its own resource file, so that after the agent is added I can delete for example GitLab's service of type LoadBalancer and then re-apply in the hopes the LoadBalancer performs as it did with v0.9.x, but after doing so this doesn't solve the problem like so
Failing this being a regression in K3s between v0.9.x and v1.0.x (as the same problem exists in the next release candidate of K3s) I was going to try spinning up the cluster first and then deploy the apps vice spinning up the server, deploying the apps and adding an agent. If this works this will lead to me refactoring the course automation. |
Thanks for the details and help with debugging. I would like to see a very minimal example of the issue, maybe using the built-in CoreDNS service. It is good to know that you are using the |
I am using the built in CoreDNS. I'll try weaning the project down to a minimal project. Is using Ansible to configure okay? It would allow me to re-use my roles paired down quite a bit of course. I will also try using the built in containerd following using docker... I selected docker, so that students were not confronted with two container runtimes and had a full view into what containers were running and with tools they had greater chance of being remotely familiar with. The project's choice of supporting both runtimes is swell. I am using CentOS 7. The Vagrantfile not finding a Vagrant base box of https://github.com/nemonik/hands-on-DevOps/tree/master/box after retrieving |
Let me try v0.10.0... I jumped from v0.9.1 to v1.0.0. That would be an easy try... Lemme try that first and get back to you. |
I'll start with v0.10.2 and walk my way back to v0.9.1 until the problem goes away. |
Thanks for the info! Was trying to parse through the repo but figured it easier to just ask some questions. The usual suspects for me might be the flags, network config, iptables (as you prob know should be legacy), or even kernel (might be worth trying Ubuntu). Sorry I don't have a better answer, thanks very much for helping to debug this, would very much like to get to the root cause and find a fix. |
i weaned things down to a simple project, but the problem wasn't there in the first run, but i did change a few things in the process... so, i'm including these changes back into the full project to see if what I changed actually addressed the problem or if this was a fluke. |
So, my success appears to have either been a brain fart or a one time fluke. The load balancer is really slow... on the agent. Here is a minimal example as you requested https://github.com/nemonik/k3s-issue-1216 Run
You will need |
All pushed... Please, let me know if you need help. Why centos? Why not alpine or ubuntu? Well, CentOS tracks RHEL. And for RHEL you have STIGs ("Security Technical Implementation Guides"). For enterprise and government, the pedigree of RHEL and therefor CentOS as it tracks RHEL is far higher and better received over others that have no STIGs or only emergent STIGs. |
I also included a role to configure GitLab... But the same problem can be demonstrated via simple hello-world web app container. Again, the problem is from the agent if you send a request to the app through the server ip the response is taking way more time than it should, vice when sending the request from the server or the host underlying the VMs. Also, if you send the request to the agent when on the agent the response is immediate. |
If you switch to deploy_app_from_agent branch you can see the same exact thing happens when you deploy the httpd app from the agent. The httpd app will not respond in a timely fashion (it will but it will take some time) to request sent from the agent, but requests sent from the host underlying the VMs and the server, so the httpd ansible role's task
will ultimately will fail, when executed on the agent. |
Thanks for the info. There is still too much going on here, I shouldn't need to checkout a repo to understand the problem. What are the server args and what are the agent args? I wasn't asking for a justification for CentOS, just for more data-points. Did you try it with containerd? |
Oh, gosh. Knowing that would of save me some time. Agent Args:
Server Args:
I can certainly run the two roles without the Understood. All I was saying is Ubuntu and Alpine are not possible for the reasons I gave. |
Thank you much for the example and all of the details. Apologies, there is just limited bandwidth for me or a QA person to understand what is happening with the repo, and as you can imagine maybe not wanting to blindly run a script unless it is in a throw-away environment. I am not wanting for you to change your OS for the lab, I just want to know if the issue exists for you in other environments. Have you looked into iptables at all? Is it possible that a firewall is causing the problem? If a firewall is running would be worth temporarily disabling it on all nodes to see if it alleviates the issue. |
CentOS ships with a firewall called
I'll given |
Running with from localhost:
from server:
from agent:
|
So, ignoring for the fact I can easily ssh from the agent into the server with no delay, if I spin up the same httpd container on port 8080 via docker that I'm spinning up via kubectl behind the k3s loadbalancer I find that I can access this container from the host, the server, and the agent with no delay. |
I disabled K3s LoadBalancer and deployed the latest MetalLB with the following configuration
Redeployed the httpd container using the MetalLB |
I've verified this seems to be an issue limited to CentOS 7... The issue doesn't appear on vagrant's running Alpine. |
I do have the same issue |
I'm not alone. It is nice to be not alone. :) |
Also on the same boat @nemonik |
Same behaviour here in Debian 10 running on VirtualBox with two interfaces where the one that should have the services exposed is eth1. |
Apparently it looks like something related to Flannel because my deployment of Prometheus cannot scrape data from pods with services exposed in cluster IPs. Also my ingress routes doesn't work externally. I've started K3s with --flannel-iface=eth1 as it's my external interface. |
I have 3 master node with Centos7. If I try to access POD running on different host using "ServiceIP" it is taking time, however if I use "PodIP" there is no delay. Due to this, other 2 master node is giving timeout error while connecting to metrics-server. Is there any workaround for this on Centos 7 ? |
I'm not even using flannel and experience the exact same results - with Cilium. I can easily replace the built-in service load balancer with metal-lb and everything works flawlessly. I'm using VirtualBox, my base boxes are ubuntu 19.10 with kernel 5.3. |
I did the same earlier replacing with metallb to get around the problem.
…On Tue, Mar 31, 2020, 2:13 PM Sean Winn ***@***.***> wrote:
I'm not even using flannel and experience the exact same results - with
Cilium. I can easily replace the built-in service load balancer with
metal-lb and everything works flawlessly.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1216 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABLAGGJWLW3IJNQPXPNHE3RKIXFPANCNFSM4J4N4G3Q>
.
|
I think running
on each worker node will temporarily resolve this issue, but only temporarily as it won't last through a reboot of a node. See: So, I got around the issue by setting |
the issue can be traced back to kubernetes/kubernetes#88986 |
a patch was made to flannel to address flannel-io/flannel#1282 |
@nemonik many thanks for keeping this issue updated, and digging in to find a solution. It sounds like this will be fixed in the CentOS kernel, but we should disable the checksum anyways? Am guessing we will want to add ethtool to k3s-root and call it during flannel setup, but it might be sufficient to cherry-pick or use a variant of that patch if it accepted. |
You are welcome. I would disable... CentOS kernel updates are very infrequent. |
Hopefully the flannel patch will be merged. |
The fix for the underlying upstream issue landed in K8s v1.19 and was cherry-picked in 1.18.6. Should be good to re-test / close. kubernetes/kubernetes#92256 |
Correct - this should be fixed on both release and master branches. |
Version:
Versions since v1.0.0.
Describe the bug
With v0.9.x, I was able to spin up a server (whose IP is 192.168.0.11) on a vagrant (VM), spin up an app behind the built-on K3s LoadBalancer on the server node, add an agent (whose ip is 192.168.0.10) on another vagrant to form a cluster and then access the app from both the server node, the agent node and the host hosting the VirtualBox hypervisor on which the server and agent is run. Since v1.0.0, as soon as the agent is added, requests to an app orchestrated on server sent from the agent spikes to well over a minute resulting in clients timing out. If the app is exposed via a NodePort there is no problem ,but the higher port numbers present a problem to novice users.
My use case needs the apps to be accessible once ssh'ed into the Agent as I teach a hands-on DevOps class using K3s to host all the application in a two node cluster with apps initially being spun up on the server and then the agent being added. The class is taught in a resource constrained environment (laptops and desktop PCs) and so students also use the agent VM for development. This approach worked perfectly pre-v1.0.0 and now is broke upon the release of v1.0.0.
For example, when the students use the
git
command line client on the development vagrant (also an agent) request sent to GitLab hosted in the cluster will fail, because GitLab orchestrated by K3s is taking too long to respond. This is only happening on the agent. GitLab quickly responds to request from the host that is hosting VirtualBox and as does it to requests from the toolchain vagrant. Although, if you send the request to development vagrant ip and port from the agent vagrant, the application will quickly respond.For example, it is taking over a minute for GitLab to respond from the development vagrant (the agent node) is 1 minute and 3.266 seconds as per
Whereas from the toolchain vagrant (also the server the node) response time is 0.103 seconds as per
If you send the request to the development vagrant ip and port of the application, the response time matches the other favorable response times
If you pop onto the development vagrant as it is being provisioned and loop over a
wget http://192.168.0.11:10080
to access GitLab orchestrated on the server node you will see the same quick response time up until the agent is added to the cluster and then it takes an in ordinate amount of time for GitLab to respond. If you expose GitLab via NodePort responses come quickly no matter where you are.To Reproduce
Spin up a K3s Server.
Spin up GitLab like so on the server
time wget http://<ip of server>:10080
in my example the ip of the server is 192.168.0.11 and note how quick GitHub responds.time wget http://<ip of server>:10080
on the agent and note how quick GitLab responds.time wget http://<ip of server>:10080
on the agent and note how quick GitLab responds.The time it takes to receive a response between steps 3 and 5 in comparison to step 7 should be very different. 3 and 5 are wicked fast. 7 is wicked slow.
Expected behavior
Apps hosted on the cluster behind the K3s built-in LoadBalancer should respond in the same fashion no matter where you access them from.
Actual behavior
If you spin up a server, spin up an app behind the built-on K3s LoadBalancer on the server node, add an agent to form a cluster and then attempt to access the app from agent the app is taking too long to respond in comparison response times on the host hosting the the two VMs (server and agent) and from the server.
Additional context
My course material is here https://github.com/nemonik/hands-on-DevOps
The text was updated successfully, but these errors were encountered: