-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state #21087
Comments
After downgrading to 2.1.0, logs showed another error (but reported with the log level INFO!)
Giving the missing permission to the scheduler's service account fixed the issue. Still need to upgrade the |
I think this is also similar root cause as #12644 . @dimberman @jedcunningham @kaxil - or maybe somoene else who has some more experiences with K8S deployments in "real life" - this error "Resource too old" is returned by K8S when there are too many changes to a version of K8S resource. But I am just wondering - it really happens IMHO because we deploy some changes "incrementally" too frequently (and too many times) in the chart/deployment ? Or maybe because we do NOT deploy the "full" deployment where we should? I am not too experienced in long running K8S deployments, but for me it looks like something that this could be solved by identifiying which resources those are and implement some full "re-deployment" from time to time. It might be, that this is outside of our control as well, but I've seen some other people complaining about that recently so maybe we could have someone who has more insights there to take a look ? |
It's not that there are too many changes, at least like you are thinking. The executor is trying to watch from (history) revision n, which has rolled off of history on the k8s side. n+2 might be the oldest available now. This really isn't related to our deployments, and there isn't anything we can do in that regard to help here. In fact, you could hit this when running the scheduler outside k8s even. More context here: #15500 (comment) Bottom line, especially now that we aren't pinned to |
Thanks for the context - now I understand where it comes from! Yeah it isn't an easy one to handle! |
I have the same issue. Looks like it is happening, because they are now(don't know from which version) handling 410 error on kubernetes library side, there is one retry and then exception is raised if the event is of type = 'ERROR' I checked the kubernetes library and it was changed in this pull request: On Airflow Kubernetes Executor it is being handled here:
By process_error function, but it probably should be now enclosed by try catch ApiException with check for 410 code |
To me, per the tickets referenced in #15500, the problem seem more in the Kubernetes Python's API. It does not handle the kind of messages we'd need for Airflow to recover when it's lost track of resources versions. Last I check there was no bug report there regarding that. So I did open one. I am sadly not the more knowledgeable to provide insight on how to best resolve this. @jedcunningham Maybe you can help? Until this is resolved on the k8s client side, what is the plan for Airflow? Wait? Resurrect the #15500 PR? (not sure if it fully solves the issue) |
The current deployable check does not allow to redeploy the same deployment once again. Meaning if your deployment gets corrupted somehow you cannot delete it and deploy it again. As an example, Airflow (jobflow) currently as a bug due to missing support for certain kubernetes resources which can have it looses track of some of the resources it created in k8s. The only work around seem to delete and create again the deployment (_cf._ apache/airflow#21087).
The current deployable check does not allow to redeploy the same deployment once again. Meaning if your deployment gets corrupted somehow you cannot delete it and deploy it again. As an example, Airflow (jobflow) currently as a bug due to missing support for certain kubernetes resources which can have it looses track of some of the resources it created in k8s. The only work around seem to delete and create again the deployment (_cf._ apache/airflow#21087).
The current deployable check does not allow to redeploy the same deployment once again. Meaning if your deployment gets corrupted somehow you cannot delete it and deploy it again. As an example, Airflow (jobflow) currently as a bug due to missing support for certain kubernetes resources which can have it looses track of some of the resources it created in k8s. The only work around seem to delete and create again the deployment (_cf._ apache/airflow#21087).
The current deployable check does not allow to redeploy the same deployment once again. Meaning if your deployment gets corrupted somehow you cannot delete it and deploy it again. As an example, Airflow (jobflow) currently as a bug due to missing support for certain kubernetes resources which can have it looses track of some of the resources it created in k8s. The only work around seem to delete and create again the deployment (_cf._ apache/airflow#21087).
I concur with @arkadiusz-bach 's analysis. The error is simply happening because newer versions of kubernetes lib throw exceptions which airflow fails to handle.
The culprit is here So, a simple fix would be to do something like this:
def run(self) -> None:
"""Performs watching"""
kube_client: client.CoreV1Api = get_kube_client()
if not self.scheduler_job_id:
raise AirflowException(NOT_STARTED_MESSAGE)
while True:
try:
self.resource_version = self._run(
kube_client, self.resource_version, self.scheduler_job_id, self.kube_config
)
except ReadTimeoutError:
self.log.warning(
"There was a timeout error accessing the Kube API. Retrying request.", exc_info=True
)
time.sleep(1)
+ except ApiException as e:
+ if e.status == 410:
+ self.resource_version = "0"
+ else:
+ raise
except Exception:
self.log.exception('Unknown error in KubernetesJobWatcher. Failing')
raise
else:
self.log.warning(
'Watch died gracefully, starting back up with: last resource_version: %s',
self.resource_version,
)
I'm not (yet) submitting a PR, as this is probably too much of a hack, but if this looks good, I'll gladly do. @cansjt I do not see any blockers from lib-kubernets side, this appears to be an airflow-related issue. |
@gkarg Read carefully. Where did I say there was a blocker? I asked how the Airflow team was planning to deal with this? I basically see two options:
If the Kubernetes client library was handling bookmark events properly, the What you are suggesting might be a good work around for Airflow. I am not sure setting the resource version to 0 is okay. Is it equivalent to say to Airflow that we do not know the resource version and, somehow, Airflow will then retrieve the right value for it? I don't know Kubernetes' API well enough, I am sorry. Nonetheless, you cannot call this an Airflow issue. The root cause is the Python Kubernetes client library treats bookmark events not like valid events but like errors. I do not think that is a proper way. Nor do the kubernetes lib authors, this comment couldn't be any clearer. They did that to bypass a decoding error of the event payload, when bookmark events were added to Kubernetes. But never got back to it, to actually implement it. In particular, the bookmark event, should provide you with the actual revision of the object. But because they are not implemented, you can't have access to that information. Which means, if my understanding is correct, you'll probably have to make additional API calls, you should not need to make, that's the point of those events. Finding a work around is not the same as treating the root of a problem. The fact that we might be able to work around it, in Airflow, does not mean the feature should/must not be implemented by the Kubernetes client library for Python. Also consider this: couldn't there be other reasons that would have Kubernetes' API answers with a 410 HTTP Error? Maybe the And you said it yourself:
Maybe... Though, honestly, having to periodically tear down and deploy again Airflow, is a bit of a hassle. Time I'd gladly spend on something else. So I'd be glad if the Airflow team would seriously consider your work around and thanks again for looking into this issue. |
@cansjt bookmark events being eaten by kubernetes-python is indeed a bug on their side, no arguments here. And I also thank you for digging into it and finding the root cause.
I did consider this, and you're likely right! :( However, that's how airflow handled this "traditionally"
I do agree that this is not a proper solution, just a workaround until something better comes along.
I feel your pain. BTW, been testing this patch, so far, so good.
I have indeed skimmed the text and did not read this issue carefully. Haven't even noticed you're the issue author. I concur with your research and apologize for the tone. P.S. Just to be clear, the latest official airflow release 2.2.4 pins kubernetes library to 11.0.0, and thus can not exhibit this behaviour. So people experiencing this bug have likely tinkered with their kubernetes library version (I know I did). So I can't even say this is a critical issue, yet. It will become critical once next airflow version is released, as the pin was removed in a recent MR (I believe that change is slated for 2.3.0). |
That's good to know. Sadly, we had to choose: stick to 2.2.3 and be able to use many of the nice features SQLAlchemy 1.4 brings (with a few quirks, but we managed 😅) or upgrade (2.2.4 comes with a SQLAlchemy <1.4 constraint). We chose the former. But that's an entirely different issue 😁. |
Ah yeah. The "quirks" are the reason we put < 1.4 in. It's easy to handle the quirks when you are an indivdual user who is dedicated to handle it, but when you need to handle a flurrry of issues from 1000s of user who expect it to "just work" - we chose the < 1.4 :) . But worry not - 2.3.0 is already >= 1.4 :). And soon(ish) it will be out. |
@potiuk Please keep in mind, that, as soon as
this ticket will indeed receive a flurry of comments :) |
I hope just "GREAT! FANTASTIC! FINALLY, ❤️ 🚀 " kind of comments :) - or do you expect something else: 😱 or 🙀 ? |
hehe, I meant reports similar to the one in the OP/Issue description. But surely lots of hearts too :) |
In airflow 2.3.0 ( I mean the watch can return events with revision numbers 1,2,3,99,4,5 and the executor will try to watch (on the retry) from revision 5 which gives At least in my two EKS cluster is easy to reproduce this scenario (see my post on stackoverflow). Below I start a watch that end after 5 second and immediately do another watch with the latest resource version (just like airflow kubernetes_executor.py does) and that always raises a
as soon as changed to keep track of the actual highest revision number with My PR #23504 tackles the aforementioned cause of "revision too old" in EKS (I guess there could be other scenarios that lead to |
Pay no attention to my last comment. I found in Resource Version semantics that
So although it worked for me in EKS the |
my understanding is that: 1.) there is a bug with the kubernetes python client, it is not updating the watcher resource_version from the BOOKMARK events. #1796 tries to fix it. 2.) BOOKMARK is not enabled by default, the client needs to request it by sending
airflow/airflow/executors/kubernetes_executor.py Lines 137 to 140 in 8f181c1
3.) also we will need to add a condition here, to check the event's type and update the last_resource_version airflow/airflow/executors/kubernetes_executor.py Lines 145 to 149 in 8f181c1
i think, it will be nice to create a WIP pull request for it. |
adding some context: kubernetes doc on bookmark
|
@snjypl, The current code base will enter an infinite loop if kubernetes python client returns an Exception when sending the last known resource version. Using BOOKMARK will certainly allow airflow to track the last known resource version more accurately, but it doesn't rule out the possibility of a 410 GONE at all. I did test with I think there are two issues here
By the way , just to demonstrate that
This will produce
So again there is never a guarantee that the last resource version that you got is going to be valid the next time that you start the watch, no matter how well you track it, if you get disconnected for more than 5 minutes. |
@ecerulm 1.) preventing 410 error from happening:To prevent it we need to use the BOOKMARK feature. it was specifically introduced for this purpose. the prevention should be done by it is because of this bug in kubernetes python client that you were getting 410 error even with 2.) handling the error once it happens.now with or without my view is, there is no need airflow to keep track of the last_resource_version or even have ResoureVersion singleton. it is not airflow job to do the actual 'watching' of resource. if Kubernetes client throws 410 error for a particular resourceVersion, then even if you restart the watch process with that resourceVersion you are bound to get the same error. i have opened a WIP pull request kubernetes-client/python#1796 for a better discussion of this issue. i went through your PR, and i can see that you are addressing the 'handle 410' part of the issue. |
this is a part that needs to be handle by the kubernetes client watch loop here. when you have BOOKMARK enabled, the api server will keep sending you the resource_version till which it has processed for your label. it does that roughly every 1 minutes. |
the code you shared is not really a good way to demonstrate
shouldn't it be
the resource_version in the first loop and the error message is same. again, it has nothing to do with bookmark. |
@snjypl , let me see if I can address you comments (in no particular order, sorry) allow_watch_bookmark=True does not prevent 410
The PR kubernetes-client/python#1796 you mention as "the bug" does NOT solve the 410 error, if you do
thats the point to simulate an already expired resource version, like it would happen if there was a watch disconnect due to some network issue, there was 5 minutes network disconnect from airflow to k8s api, until the next watch could actually be performed, then the resource_version will be already expired. In order to simulate that quickly I use This was explicitly mentioned in the example comments :
It demonstrates that today if you use a This is true in
I don't think that is true, the resourceVersion can be too old by the time you get to use it. Two things to keep in mind:
I fully agree with that statement but I guess that's a hefty change. But I do really thing Airflow kubernetes_executor.py is trying too hard to track resource_version and getting little to nothing by doing so. |
@ecerulm sorry, but it is not how watch works. if you start a watch with an old resourceVersion it WILL always give you 410 error. that is the expected behaviour.
in short: if you get 410, start a fresh watch. don't bother about the recent resoureVersion, because whatever maybe the resourceVersion that your application is having it would still be old. you can't have a fresher resource version than the kube client watcher. maybe looking at the unittest for the bookmark events might help:
In this scenario, you need to start a new watch without resourceVersion or resoureVersion="0". |
airflow does not need to track the resoure_version. kube client watcher will do it. |
I guess we have a language barrier here, because I already knew that The only thing needed to solve his particular issue #21087 is to handle 410 GONE in kubernetes_executor.py , there is no need for a "bug fix" in kubernetes python client and even with that bug fix we will still get 410. So maybe we should stop talking about watch bookmarks when it comes to this particular issue. |
: ) it's been a long thread. once that issue is fixed in kube client, you won't be getting 410 errors , at least not so frequently, when you do a watch. in the rare case of a 410 error. you NEED to start with resource_version="0". the last_resource_version is not going to be any help. like i mentioned in my previous comment. BOOKMARK is required for preventing 410 error. |
so like I said, we will still see 410 errors. If we handle 410 in kubernetes_executor.py this issue will be solved regardless of kubernetes python client fixes. So those fixes are inconsequential for this issue
well, it's not that rare and I don't think that supporting watch bookmarks is going to change radically how often we get them, because I think the main source of those 410 is that there is a disconnect between airflow and kuberenetes api and the resource_version will be expired by the time it reconnects.
I thought that we agreed that this is not preventing 410, you will still get it if enough time passed between the disconnect and the when the next watch is goes through because at that time the resource version that kuberentes_executor.py is sending in the watch has already expired. Again, the bookmark thing is not preventing that at all. I think it's going to be very confusing for people reading this thread and thinking that some fix is required in kubernetes python client , because that fix is not required at all to fix this issue, the only thing needed to fix this issue is to handle 410 properly. |
@ecerulm please go through the docs. it will help you understand why 410 occurs and how BOOKMARK will prevent it. 410 does not happen due to network issues or connectivity issue between airflow and kubeapi server. it happens when no event of type ADDED, MODIFIED, DELETED happens on the watched resource for a long time [ ~ 5 mins]. it would really help the conversation if you would take the time to go through the KEP and other docs. |
I have reproduced this locally with minikube and kubernetes python client so I can assure you it CAN happen due to network issues between airflow and the k8s api. Let me explain the setup
I've been running a watch with the kubernetes python client to a namespace where there is not new events at all for 1 hours and I did not get an ApiException(410). So, are you sure of this? Have you ever seen this yourself in your kubernetes environment?
I did read all the documents, and I think I understand this ok, also I have actually done testing and try to actually back up what I say by doing it. I think you mean something else by "prevent". I hope the scenario I included in this comment will help you understand why 410 occurs in the event of network issues and how BOOKMARK can't prevent that. In principle BOOKMARK will help to get a better "last known resource version " at step 5 but by the time step 8 is reached that resource version won't be valid (if enough time has passed). And this is not theory it's something that you actually do test and reproduce yourself like I did. |
@ecerulm, that is not a valid scenario to test/demostrate bookmark feature. |
Well, but this is one of the scenarios that is causing KubernetesJobWatcher to get 410 and get stuck in an infinite loop. So again what I'm trying to say is that bookmarks do not help to solve #21087 is this scenario. Just handling 410 properly solves #21087 no need for bookmark at all to solve this github issue. |
@ecerulm , i believe when there is an error there are two parts to it a.) handling the exception and recovering from it. what you are suggesting covers part (a). it will allow airflow to recover from 410 and not get stuck in an infinite loop. we both agree on that. now, using bookmark etc would come in part (b) trying to fix the root cause or try to prevent the 410 error. part (b) requires a bug fix in kubernetes-python client and also some changes in airflow code. both parts are equally important. ultimately it is for the airflow maintainers and community members to decide what is best for them. i am just trying to help understand the root cause of 410 and possible ways prevent it from recurring. i think, we should just leave it here and let others participate in the discussion, and hear their views. |
Like I said before, my only concern is that someone that reads this thread maybe get the false impression that watch bookmarks are needed to stop the scheduler from getting stuck in an infinite loop and some people may even get the idea that we need a new kubernets python client before we can solve this, which is not the case. I just tried to explain that watch bookmarks are not needed to solve the infinite loop (which is what this #21087 is about IMHO) . And that watch bookmarks alone will not prevent the 410 at least the scenario that I'm personally experiencing (which I already explained).
One of the root causes would be the one that I explained, which I don't think can't be prevented from recurring.
well, I think a) is more important and that's why I'm so adamant into making sure that a) is done and not delayed waiting for kubernetes client updates, etc. Since just I totally agree to leave it here, I think with these two last post is clear what you mean, and what I mean. |
Apache Airflow version
2.2.3 (latest released)
What happened
After upgrading Airflow to 2.2.3 (from 2.2.2) and cncf.kubernetes provider to 3.0.1 (from 2.0.3) we started to see these errors in the logs:
Pods are created and run to completion, but it seems the KubernetesJobWatcher is incapable of seeing that they completed. From there Airflow goes to a complete halt.
What you expected to happen
No errors in the logs and the job watcher does it's job of collecting completed jobs.
How to reproduce
I wish I knew. Trying to downgrade the cncf.kubernetes provider to previous versions to see if it helps.
Operating System
k8s (Airflow images are Debian based)
Versions of Apache Airflow Providers
apache-airflow-providers-amazon 2.6.0
apache-airflow-providers-cncf-kubernetes 3.0.1
apache-airflow-providers-ftp 2.0.1
apache-airflow-providers-http 2.0.2
apache-airflow-providers-imap 2.1.0
apache-airflow-providers-postgres 2.4.0
apache-airflow-providers-sqlite 2.0.1
Deployment
Other
Deployment details
The deployment is on k8s v1.19.16, made with helm3.
Anything else
This, in the symptoms, look a lot like #17629 but happens in a different place.
Redeploying as suggested in that issues seemed to help, but most jobs that were supposed to run last night got stuck again. All jobs use the same pod template, without any customization.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: