KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state #21087

cansjt · 2022-01-25T08:50:17Z

Apache Airflow version

2.2.3 (latest released)

What happened

After upgrading Airflow to 2.2.3 (from 2.2.2) and cncf.kubernetes provider to 3.0.1 (from 2.0.3) we started to see these errors in the logs:

{"asctime": "2022-01-25 08:19:39", "levelname": "ERROR", "process": 565811, "name": "airflow.executors.kubernetes_executor.KubernetesJobWatcher", "funcName": "run", "lineno": 111, "message": "Unknown error in KubernetesJobWatcher. Failing", "exc_info": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\", line 102, in run\n    self.resource_version = self._run(\n  File \"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\", line 145, in _run\n    for event in list_worker_pods():\n  File \"/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py\", line 182, in stream\n    raise client.rest.ApiException(\nkubernetes.client.exceptions.ApiException: (410)\nReason: Expired: too old resource version: 655595751 (655818065)\n"}
Process KubernetesJobWatcher-6571:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 102, in run
    self.resource_version = self._run(
  File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
    for event in list_worker_pods():
  File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 655595751 (655818065)

Pods are created and run to completion, but it seems the KubernetesJobWatcher is incapable of seeing that they completed. From there Airflow goes to a complete halt.

What you expected to happen

No errors in the logs and the job watcher does it's job of collecting completed jobs.

How to reproduce

I wish I knew. Trying to downgrade the cncf.kubernetes provider to previous versions to see if it helps.

Operating System

k8s (Airflow images are Debian based)

Versions of Apache Airflow Providers

apache-airflow-providers-amazon 2.6.0
apache-airflow-providers-cncf-kubernetes 3.0.1
apache-airflow-providers-ftp 2.0.1
apache-airflow-providers-http 2.0.2
apache-airflow-providers-imap 2.1.0
apache-airflow-providers-postgres 2.4.0
apache-airflow-providers-sqlite 2.0.1

Deployment

Other

Deployment details

The deployment is on k8s v1.19.16, made with helm3.

Anything else

This, in the symptoms, look a lot like #17629 but happens in a different place.
Redeploying as suggested in that issues seemed to help, but most jobs that were supposed to run last night got stuck again. All jobs use the same pod template, without any customization.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

cansjt · 2022-01-28T17:03:41Z

After downgrading to 2.1.0, logs showed another error (but reported with the log level INFO!)

{"asctime": "2022-01-25 17:24:18",
 "levelname": "INFO",
 "process": 1,
 "name": "airflow.executors.kubernetes_executor.KubernetesExecutor",
 "funcName": "_adopt_completed_pods",
 "lineno": 740,
 "message": "Failed to adopt pod <removed>. Reason: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '<removed>', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 25 Jan 2022 17:24:18 GMT', 'Content-Length': '414'})
HTTP response body: {
  \"kind\":\"Status\",
  \"apiVersion\":\"v1\",
  \"metadata\":{},
  \"status\":\"Failure\",
  \"message\":\"pods \\\"<removed>\\\" is forbidden: User \\\"system:serviceaccount:<removed>\\\" cannot patch resource \\\"pods\\\" in API group \\\"\\\" in the namespace \\\"<removed>\\\"\",
  \"reason\":\"Forbidden\",
  \"details\":{\"name\":\"<removed>\",
                   \"kind\":\"pods\"},
  \"code\":403}

"}

Giving the missing permission to the scheduler's service account fixed the issue. Still need to upgrade the cncf.kubernetes provider again to be sure this was the root cause of the reported error or not.

potiuk · 2022-02-06T13:27:56Z

I think this is also similar root cause as #12644 . @dimberman @jedcunningham @kaxil - or maybe somoene else who has some more experiences with K8S deployments in "real life" - this error "Resource too old" is returned by K8S when there are too many changes to a version of K8S resource.

But I am just wondering - it really happens IMHO because we deploy some changes "incrementally" too frequently (and too many times) in the chart/deployment ? Or maybe because we do NOT deploy the "full" deployment where we should?

I am not too experienced in long running K8S deployments, but for me it looks like something that this could be solved by identifiying which resources those are and implement some full "re-deployment" from time to time.

It might be, that this is outside of our control as well, but I've seen some other people complaining about that recently so maybe we could have someone who has more insights there to take a look ?

jedcunningham · 2022-02-10T18:46:54Z

It's not that there are too many changes, at least like you are thinking. The executor is trying to watch from (history) revision n, which has rolled off of history on the k8s side. n+2 might be the oldest available now.

This really isn't related to our deployments, and there isn't anything we can do in that regard to help here. In fact, you could hit this when running the scheduler outside k8s even.

More context here: #15500 (comment)

Bottom line, especially now that we aren't pinned to kubernetes==11, we can probably handle this more gracefully now on our side.

potiuk · 2022-02-10T19:33:09Z

More context here: #15500 (comment)

Bottom line, especially now that we aren't pinned to kubernetes==11, we can probably handle this more gracefully now on our side.

Thanks for the context - now I understand where it comes from! Yeah it isn't an easy one to handle!

arkadiusz-bach · 2022-02-28T01:09:53Z

I have the same issue.

Looks like it is happening, because they are now(don't know from which version) handling 410 error on kubernetes library side, there is one retry and then exception is raised if the event is of type = 'ERROR'

I checked the kubernetes library and it was changed in this pull request:
https://github.com/kubernetes-client/python-base/pull/133/files

On Airflow Kubernetes Executor it is being handled here:

airflow/airflow/executors/kubernetes_executor.py

Line 148 in d726579

if event['type'] == 'ERROR':

By process_error function, but it probably should be now enclosed by try catch ApiException with check for 410 code

cansjt · 2022-02-28T08:19:20Z

To me, per the tickets referenced in #15500, the problem seem more in the Kubernetes Python's API. It does not handle the kind of messages we'd need for Airflow to recover when it's lost track of resources versions.

Last I check there was no bug report there regarding that. So I did open one. I am sadly not the more knowledgeable to provide insight on how to best resolve this. @jedcunningham Maybe you can help?

Until this is resolved on the k8s client side, what is the plan for Airflow? Wait? Resurrect the #15500 PR? (not sure if it fully solves the issue)

The current deployable check does not allow to redeploy the same deployment once again. Meaning if your deployment gets corrupted somehow you cannot delete it and deploy it again. As an example, Airflow (jobflow) currently as a bug due to missing support for certain kubernetes resources which can have it looses track of some of the resources it created in k8s. The only work around seem to delete and create again the deployment (_cf._ apache/airflow#21087).

gkarg · 2022-04-04T18:20:57Z

I have the same issue.

Looks like it is happening, because they are now(don't know from which version) handling 410 error on kubernetes library side, there is one retry and then exception is raised if the event is of type = 'ERROR'

I checked the kubernetes library and it was changed in this pull request: https://github.com/kubernetes-client/python-base/pull/133/files

On Airflow Kubernetes Executor it is being handled here:

airflow/airflow/executors/kubernetes_executor.py

Line 148 in d726579

if event['type'] == 'ERROR':

By process_error function, but it probably should be now enclosed by try catch ApiException with check for 410 code

I concur with @arkadiusz-bach 's analysis.

The error is simply happening because newer versions of kubernetes lib throw exceptions which airflow fails to handle.

Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
File "/opt/conda/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 103, in run
    kube_client, self.resource_version, self.scheduler_job_id, self.kube_config
File "/opt/conda/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
    for event in list_worker_pods():
File "/opt/conda/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 183, in stream
    status=obj['code'], reason=reason)
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 891097969 (891307823)

The culprit is here for event in list_worker_pods():, thus _process_error never gets a chance to run.

So, a simple fix would be to do something like this:

airflow/airflow/executors/kubernetes_executor.py

Line 95 in 2cf1ae3

def run(self) -> None:

    def run(self) -> None:
        """Performs watching"""
        kube_client: client.CoreV1Api = get_kube_client()
        if not self.scheduler_job_id:
            raise AirflowException(NOT_STARTED_MESSAGE)
        while True:
            try:
                self.resource_version = self._run(
                    kube_client, self.resource_version, self.scheduler_job_id, self.kube_config
                )
            except ReadTimeoutError:
                self.log.warning(
                    "There was a timeout error accessing the Kube API. Retrying request.", exc_info=True
                )
                time.sleep(1)
+            except ApiException as e:
+                if e.status == 410:
+                    self.resource_version = "0"
+                else:
+                    raise
            except Exception:
                self.log.exception('Unknown error in KubernetesJobWatcher. Failing')
                raise
            else:
                self.log.warning(
                    'Watch died gracefully, starting back up with: last resource_version: %s',
                    self.resource_version,
                )

I'm not (yet) submitting a PR, as this is probably too much of a hack, but if this looks good, I'll gladly do.

@cansjt I do not see any blockers from lib-kubernets side, this appears to be an airflow-related issue.

cansjt · 2022-04-05T06:09:33Z

@gkarg Read carefully. Where did I say there was a blocker? I asked how the Airflow team was planning to deal with this? I basically see two options:

Accept a work around, like the one you suggest (thanks for that);
Or wait for the Kubernetes client library to implement proper support for bookmark events;

If the Kubernetes client library was handling bookmark events properly, the _run() method should/could simply output the updated resource version, instead of raising an exception.

What you are suggesting might be a good work around for Airflow. I am not sure setting the resource version to 0 is okay. Is it equivalent to say to Airflow that we do not know the resource version and, somehow, Airflow will then retrieve the right value for it? I don't know Kubernetes' API well enough, I am sorry.

Nonetheless, you cannot call this an Airflow issue. The root cause is the Python Kubernetes client library treats bookmark events not like valid events but like errors. I do not think that is a proper way. Nor do the kubernetes lib authors, this comment couldn't be any clearer. They did that to bypass a decoding error of the event payload, when bookmark events were added to Kubernetes. But never got back to it, to actually implement it. In particular, the bookmark event, should provide you with the actual revision of the object. But because they are not implemented, you can't have access to that information. Which means, if my understanding is correct, you'll probably have to make additional API calls, you should not need to make, that's the point of those events.

Finding a work around is not the same as treating the root of a problem. The fact that we might be able to work around it, in Airflow, does not mean the feature should/must not be implemented by the Kubernetes client library for Python.

Also consider this: couldn't there be other reasons that would have Kubernetes' API answers with a 410 HTTP Error? Maybe the _run() method only makes one single kubernetes API call, making the answer unambiguous, maybe it doesn't (I don't know, nor did I look for the answer).

And you said it yourself:

this is probably too much of a hack

Maybe... Though, honestly, having to periodically tear down and deploy again Airflow, is a bit of a hassle. Time I'd gladly spend on something else. So I'd be glad if the Airflow team would seriously consider your work around and thanks again for looking into this issue.

gkarg · 2022-04-05T06:30:16Z

@cansjt bookmark events being eaten by kubernetes-python is indeed a bug on their side, no arguments here. And I also thank you for digging into it and finding the root cause.

Also consider this: couldn't there be other reasons that would have Kubernetes' API answers with a 410 HTTP Error?

I did consider this, and you're likely right! :( However, that's how airflow handled this "traditionally"

airflow/airflow/executors/kubernetes_executor.py

Line 178 in 2cf1ae3

if raw_object['code'] == 410:

so it's not a new behaviour, just restoration of old (arguably hacky) behaviour.

I do agree that this is not a proper solution, just a workaround until something better comes along.

Though, honestly, having to periodically tear down and deploy again Airflow, is a bit of a hassle

I feel your pain. BTW, been testing this patch, so far, so good.

Read carefully

I have indeed skimmed the text and did not read this issue carefully. Haven't even noticed you're the issue author. I concur with your research and apologize for the tone.

P.S. Just to be clear, the latest official airflow release 2.2.4 pins kubernetes library to 11.0.0, and thus can not exhibit this behaviour. So people experiencing this bug have likely tinkered with their kubernetes library version (I know I did). So I can't even say this is a critical issue, yet. It will become critical once next airflow version is released, as the pin was removed in a recent MR (I believe that change is slated for 2.3.0).

cansjt · 2022-04-05T06:50:46Z

the latest official airflow release 2.2.4 pins kubernetes library to 11.0.0

That's good to know. Sadly, we had to choose: stick to 2.2.3 and be able to use many of the nice features SQLAlchemy 1.4 brings (with a few quirks, but we managed 😅) or upgrade (2.2.4 comes with a SQLAlchemy <1.4 constraint). We chose the former. But that's an entirely different issue 😁.

potiuk · 2022-04-05T09:32:35Z

That's good to know. Sadly, we had to choose: stick to 2.2.3 and be able to use many of the nice features SQLAlchemy 1.4 brings (with a few quirks, but we managed 😅) or upgrade (2.2.4 comes with a #21235). We chose the former. But that's an entirely different issue 😁.

Ah yeah. The "quirks" are the reason we put < 1.4 in. It's easy to handle the quirks when you are an indivdual user who is dedicated to handle it, but when you need to handle a flurrry of issues from 1000s of user who expect it to "just work" - we chose the < 1.4 :) .

But worry not - 2.3.0 is already >= 1.4 :). And soon(ish) it will be out.

gkarg · 2022-04-05T09:37:53Z

That's good to know. Sadly, we had to choose: stick to 2.2.3 and be able to use many of the nice features SQLAlchemy 1.4 brings (with a few quirks, but we managed 😅) or upgrade (2.2.4 comes with a #21235). We chose the former. But that's an entirely different issue 😁.

Ah yeah. The "quirks" are the reason we put < 1.4 in. It's easy to handle the quirks when you are an indivdual user who is dedicated to handle it, but when you need to handle a flurrry of issues from 1000s of user who expect it to "just work" - we chose the < 1.4 :) .

But worry not - 2.3.0 is already >= 1.4 :). And soon(ish) it will be out.

@potiuk Please keep in mind, that, as soon as

2.3.0 is out - with kubernetes-python pin lifted
and people start upgrading their kubernetes-python (there are very good reasons to)

this ticket will indeed receive a flurry of comments :)

potiuk · 2022-04-05T10:48:17Z

this ticket will indeed receive a flurry of comments :)

I hope just "GREAT! FANTASTIC! FINALLY, ❤️ 🚀 " kind of comments :) - or do you expect something else: 😱 or 🙀 ?

gkarg · 2022-04-05T10:57:25Z

this ticket will indeed receive a flurry of comments :)

I hope just "GREAT! FANTASTIC! FINALLY, ❤️ 🚀 " kind of comments :) - or do you expect something else: 😱 or 🙀 ?

hehe, I meant reports similar to the one in the OP/Issue description. But surely lots of hearts too :)

ecerulm · 2022-05-05T21:54:53Z

#15500 (comment)
The executor is trying to watch from (history) revision n, which has rolled off of history on the k8s side. n+2 might be the oldest available now.

In airflow 2.3.0 (kubernetes==23.3.0) the executor tries to watch from revision n where n is the last received revision , at least in my testing in EKS the last received revision is not necessarily the highest /most recent revision number.

I mean the watch can return events with revision numbers 1,2,3,99,4,5 and the executor will try to watch (on the retry) from revision 5 which gives resource too old. It should try to watch from revision 99.

At least in my two EKS cluster is easy to reproduce this scenario (see my post on stackoverflow). Below I start a watch that end after 5 second and immediately do another watch with the latest resource version (just like airflow kubernetes_executor.py does) and that always raises a resource too old for me. I guess in EKS you can't really ask for any other revision than 0 or the actual latest.

# python3 -m venv venv
# source venv/bin/activate
# pip install 'kubernetes==23.3.0'

from kubernetes import client,config,watch

config.load_kube_config(context='my-eks-context')

v1 = client.CoreV1Api()
watcher = watch.Watch()


namespace = 'kube-system'
last_resource_version=0

# this watch will timeout in 5s to have a fast way to simulate a watch that need to be retried 
for i in  watcher.stream(v1.list_namespaced_pod, namespace, resource_version=last_resource_version, timeout_seconds=5):
   print(i['object'].metadata.resource_version)
   last_resource_version = i['object'].metadata.resource_version


# we retry the watch starting from the last resource version known
# but this ALWAYS raises ApiException: (410) Reason: Expired: too old resource version: 379140622 (380367990) for me
for i in  watcher.stream(v1.list_namespaced_pod, namespace, resource_version=last_resource_version, timeout_seconds=5):
    print('second loop', i['object'].metadata.resource_version)
    last_resource_version = i['object'].metadata.resource_version

as soon as changed to keep track of the actual highest revision number with last_resource_version = max(latest_resource_version,i['object'].metadata.resource_version) I stop getting those resource too old

My PR #23504 tackles the aforementioned cause of "revision too old" in EKS (I guess there could be other scenarios that lead to resource too old)

ecerulm · 2022-05-06T06:53:59Z

Pay no attention to my last comment.

I found in Resource Version semantics that

You must not assume resource versions are numeric or collatable. API clients may only compare two resource versions for equality (this means that you ***must not compare resource versions for greater-than ** or less-than relationships).

So although it worked for me in EKS the max(...) is not legal.

snjypl · 2022-05-08T21:43:28Z

my understanding is that:

1.) there is a bug with the kubernetes python client, it is not updating the watcher resource_version from the BOOKMARK events. #1796 tries to fix it.

2.) BOOKMARK is not enabled by default, the client needs to request it by sending allow_watch_bookmarks=True in the request. by default it is false. currently KubernetesJobWatcher is not sending it. so it is not using the BOOKMARK feature.

airflow/airflow/executors/kubernetes_executor.py

Line 129 in 8f181c1

kwargs = {'label_selector': f'airflow-worker={scheduler_job_id}'}

airflow/airflow/executors/kubernetes_executor.py

Lines 137 to 140 in 8f181c1

    
           if self.multi_namespace_mode: 
        
               list_worker_pods = functools.partial( 
        
                   watcher.stream, kube_client.list_pod_for_all_namespaces, **kwargs 
        
               )

3.) also we will need to add a condition here, to check the event's type and update the last_resource_version

airflow/airflow/executors/kubernetes_executor.py

Lines 145 to 149 in 8f181c1

    
           for event in list_worker_pods(): 
        
               task = event['object'] 
        
               self.log.info('Event: %s had an event of type %s', task.metadata.name, event['type']) 
        
               if event['type'] == 'ERROR': 
        
                   return self.process_error(event)

i think, it will be nice to create a WIP pull request for it.

snjypl · 2022-05-08T21:57:49Z

adding some context: kubernetes doc on bookmark

A given Kubernetes server will only preserve a historical record of changes for a limited time. Clusters using etcd 3 preserve changes in the last 5 minutes by default.

To mitigate the impact of short history window, the Kubernetes API provides a watch event named BOOKMARK. It is a special kind of event to mark that all changes up to a given resourceVersion the client is requesting have already been sent.

As a client, you can request BOOKMARK events by setting the allowWatchBookmarks=true query parameter to a watch request,

ecerulm · 2022-05-09T11:45:34Z

@snjypl, allow_watch_bookmarks does not prevent a "410 GONE", and does not prevent the current airflow code path to enter an infinite loop because of it.

The current code base will enter an infinite loop if kubernetes python client returns an Exception when sending the last known resource version.

Using BOOKMARK will certainly allow airflow to track the last known resource version more accurately, but it doesn't rule out the possibility of a 410 GONE at all. I did test with kubernetes==23.3.0 and you still get a 410 when using allow_watch_bookmarks=True if the resource_version is expired (which can happen if there is let's say a 10 minutes network disconnect)

I think there are two issues here

watch with bookmarks, which is good to have and allow better tracking of the last resource_version but does not prevent 410 gone.
this issue which is about airflow entering an unrecoverable infinite loop if watch ever returns a 410 GONE. This is solved on Prevent KubernetesJobWatcher getting stuck on resource too old #23521.

By the way , just to demonstrate that allow_watch_bookmarks=True does not prevent a 410 GONE / ApiException here is some snippet that I run to check it

rom kubernetes import client,config,watch
from kubernetes.client.exceptions import ApiException

config.load_kube_config(context='eks-prod')
namespace = 'my-namespace'

v1 = client.CoreV1Api()
watcher = watch.Watch()

print('First get/list/watch it will disconnect in 2 seconds')
last_resource_version = None

#  https://raw.githubusercontent.com/kubernetes-client/python/master/kubernetes/docs/CoreV1Api.md
for i in  watcher.stream(v1.list_namespaced_pod, namespace=namespace, allow_watch_bookmarks=True, timeout_seconds=2):
   resource_version = i['object'].metadata.resource_version
   print('first loop', resource_version)
   last_resource_version = resource_version

print('The watch disconnected, the last known resource version was', last_resource_version)
# assume that last_resource_version has already expired by the time we are able to reconnect to kubernetes api 
# to simulate that scenario  I send resource_version=1 which is for sure expired

print('Try to start a new watch with allow_watch_bookmarks=True and resource_version=1')
for i in  watcher.stream(v1.list_namespaced_pod, namespace, resource_version=1, allow_watch_bookmarks=True,timeout_seconds=2):
   resource_version = i['object'].metadata.resource_version
   print('second loop', resource_version)
   last_resource_version = resource_version

This will produce

....
first loop 380775416
The watch disconnected, the last known resource version was 380775416
Traceback (most recent call last):
  File "main6.py", line 23, in <module>
    for i in  watcher.stream(v1.list_namespaced_pod, namespace, resource_version=1, allow_watch_bookmarks=True,timeout_seconds=2):
  File "/Users/rubelagu/git/python-kubernetes-client/venv/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)

So again there is never a guarantee that the last resource version that you got is going to be valid the next time that you start the watch, no matter how well you track it, if you get disconnected for more than 5 minutes.

snjypl · 2022-05-09T12:44:58Z

@ecerulm
as i understand there are two aspects to this issue:

1.) preventing 410 error from happening:

To prevent it we need to use the BOOKMARK feature. it was specifically introduced for this purpose. the prevention should be done by kubernetes python client in the stream loop here
now because of a bug in the kubernetes python client, the BOOKMARK events were not used correctly. the following pull requests seeks to fix it. kubernetes-client/python#1796

it is because of this bug in kubernetes python client that you were getting 410 error even with allow_watch_bookmark enabled.

2.) handling the error once it happens.

now with or without allow_watch_bookmarks enabled , in case 410 error occurs the client is expected to restart the watch with resource_version ="0" or None [ None is preferred unless you have a need to trace from the beginning]
i understand your pull request seeks to correctly address the handling 410 part of the issue.

my view is, there is no need airflow to keep track of the last_resource_version or even have ResoureVersion singleton. it is not airflow job to do the actual 'watching' of resource.

if Kubernetes client throws 410 error for a particular resourceVersion, then even if you restart the watch process with that resourceVersion you are bound to get the same error.

i have opened a WIP pull request kubernetes-client/python#1796 for a better discussion of this issue.

@ecerulm

i went through your PR, and i can see that you are addressing the 'handle 410' part of the issue.
my suggestion would be that, we can catch the ApiException and call _run with resource_version="0" if e.status == 410.
@cansjt has raised the concern about relying on the status for this particular error, but i believe it is safe, 410 is raised only for this particular error.

snjypl · 2022-05-09T12:54:09Z

So again there is never a guarantee that the last resource version that you got is going to be valid the next time that you start the watch, no matter how well you track it, if you get disconnected for more than 5 minutes.
i will try to explain it a bit:

this is a part that needs to be handle by the kubernetes client watch loop here.

when you have BOOKMARK enabled, the api server will keep sending you the resource_version till which it has processed for your label. it does that roughly every 1 minutes.
because of the bookmark event you will always have a resourceVersion which is not old and is in the etcd cache.

snjypl · 2022-05-09T13:08:07Z

the code you shared is not really a good way to demonstrate BOOKMARK.
but still:

for i in watcher.stream(v1.list_namespaced_pod, namespace, resource_version=1, allow_watch_bookmarks=True,timeout_seconds=2):
resource_version = i['object'].metadata.resource_version

shouldn't it be resource_version = last_resource_version. resource_version = 1 will mostly always fail.

first loop 380775416
The watch disconnected, the last known resource version was 380775416

the resource_version in the first loop and the error message is same. again, it has nothing to do with bookmark.

ecerulm · 2022-05-09T14:19:02Z

@snjypl , let me see if I can address you comments (in no particular order, sorry)

allow_watch_bookmark=True does not prevent 410

it is because of this bug in kubernetes python client that you were getting 410 error even with allow_watch_bookmark enabled.

The PR kubernetes-client/python#1796 you mention as "the bug" does NOT solve the 410 error, if you do allow_watch_bookmark and resource_version=someexpiredresourceversion I still get a 410 GONE using your PR code or am I missing something?:

 git clone [email protected]:snjypl/kubernetes-client-python.git
cd kubernetes-client-python
git checkout  bugfix/1729-bookmarkevents
pip install -e . 
# Successfully installed kubernetes-23.0.1-snapshot
cd ..
python main6.py # the code that I shared earlier
Try to start a new watch with allow_watch_bookmarks=True and resource_version=1
Traceback (most recent call last):
  File "main6.py", line 24, in <module>
    for i in  watcher.stream(v1.list_namespaced_pod, namespace, resource_version=1, allow_watch_bookmarks=True,timeout_seconds=2):
  File "/Users/rubelagu/git/python-kubernetes-client/kubernetes-client-python/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 1 (384756203)

shouldn't it be resource_version = last_resource_version. resource_version = 1 will mostly always fail.

thats the point to simulate an already expired resource version, like it would happen if there was a watch disconnect due to some network issue, there was 5 minutes network disconnect from airflow to k8s api, until the next watch could actually be performed, then the resource_version will be already expired. In order to simulate that quickly I use resource_version=1

This was explicitly mentioned in the example comments :

# assume that last_resource_version has already expired by the time we are able to reconnect to kubernetes api 
# to simulate that scenario  I send resource_version=1 which is for sure expired

the code you shared is not really a good way to demonstrate BOOKMARK.

It demonstrates that today if you use a allow_watch_bookmarks=True and resource_version=some_already_expired_resource_version you will still get a 410 GONE.

This is true in kubernetes==20.3.0 and also true in your PR kubernetes-client/python#1796 because I tested with your PR code

because of the bookmark event you will always have a resourceVersion which is not old and is in the etcd cache.

I don't think that is true, the resourceVersion can be too old by the time you get to use it. Two things to keep in mind:

As per kubernetes documentation watch events:

you shouldn't assume bookmarks are returned at any specific interval, nor can clients assume that the API server will send any BOOKMARK event even when requested.

It's possible that some amount of time happens between the disconnect and the next successful watch connection. So you can get a perfection accurate resource_version X at the time of disconnection, and when the network connectivity is restored then that resource_version X is already expired.

my view is, there is no need airflow to keep track of the last_resource_version or even have ResourceVersion singleton. it is not airflow job to do the actual 'watching' of resource.

I fully agree with that statement but I guess that's a hefty change. But I do really thing Airflow kubernetes_executor.py is trying too hard to track resource_version and getting little to nothing by doing so.

snjypl · 2022-05-09T15:46:03Z

@ecerulm sorry, but it is not how watch works.

if you start a watch with an old resourceVersion it WILL always give you 410 error. that is the expected behaviour.

When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a new get or list operation, and starting the watch from the resourceVersion that was returned.

in short: if you get 410, start a fresh watch. don't bother about the recent resoureVersion, because whatever maybe the resourceVersion that your application is having it would still be old. you can't have a fresher resource version than the kube client watcher.

maybe looking at the unittest for the bookmark events might help:

https://github.com/kubernetes-client/python/blob/f379bbabf33f239f49b0e1b7652b6212f3960910/kubernetes/base/watch/watch_test.py#L170-L204

thats the point to simulate an already expired resource version, like it would happen if there was a watch disconnect due to some network issue, there was 5 minutes network disconnect from airflow to k8s api, until the next watch could actually be performed, then the resource_version will be already expired. In order to simulate that quickly I use resource_version=1

In this scenario, you need to start a new watch without resourceVersion or resoureVersion="0".
when you pass any other resourceVersion, you can asking to start from that specify version and it does not exist in the case you will get 410 error.

snjypl · 2022-05-09T15:51:30Z

I fully agree with that statement but I guess that's a hefty change. But I do really thing Airflow kubernetes_executor.py is trying too hard to track resource_version and getting little to nothing by doing so.

airflow does not need to track the resoure_version. kube client watcher will do it.

ecerulm · 2022-05-09T16:22:05Z

sorry, if you start a watch with an old resourceVersion it WILL always give you 410 error. that is the expected behaviour.

I guess we have a language barrier here, because I already knew that
and that's what I've been trying to explain to you all along. I guess I misunderstood what you mean by it is because of this bug in kubernetes python client that you were getting 410 error even with allow_watch_bookmark enabled.

The only thing needed to solve his particular issue #21087 is to handle 410 GONE in kubernetes_executor.py , there is no need for a "bug fix" in kubernetes python client and even with that bug fix we will still get 410. So maybe we should stop talking about watch bookmarks when it comes to this particular issue.

snjypl · 2022-05-09T16:43:06Z

: ) it's been a long thread.

once that issue is fixed in kube client, you won't be getting 410 errors , at least not so frequently, when you do a watch. in the rare case of a 410 error. you NEED to start with resource_version="0". the last_resource_version is not going to be any help.

like i mentioned in my previous comment. BOOKMARK is required for preventing 410 error.

snjypl · 2022-05-09T16:46:45Z

for ref: https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/956-watch-bookmark

ecerulm · 2022-05-09T16:53:42Z

once that issue is fixed you kube client, you won't be getting 410 errors , at least not so frequently, when you do a watch.

so like I said, we will still see 410 errors. If we handle 410 in kubernetes_executor.py this issue will be solved regardless of kubernetes python client fixes. So those fixes are inconsequential for this issue

in the rare case of a 410 error.

well, it's not that rare and I don't think that supporting watch bookmarks is going to change radically how often we get them, because I think the main source of those 410 is that there is a disconnect between airflow and kuberenetes api and the resource_version will be expired by the time it reconnects.

like i mentioned in my previous comment. BOOKMARK is required for preventing 410 error.

I thought that we agreed that this is not preventing 410, you will still get it if enough time passed between the disconnect and the when the next watch is goes through because at that time the resource version that kuberentes_executor.py is sending in the watch has already expired. Again, the bookmark thing is not preventing that at all.

I think it's going to be very confusing for people reading this thread and thinking that some fix is required in kubernetes python client , because that fix is not required at all to fix this issue, the only thing needed to fix this issue is to handle 410 properly.

snjypl · 2022-05-09T17:14:28Z

well, it's not that rare and I don't think that supporting watch bookmarks is going to change radically how often we get them, because I think the main source of those 410 is that there is a disconnect between airflow and kuberenetes api and the resource_version will be expired by the time it reconnects.

@ecerulm please go through the docs. it will help you understand why 410 occurs and how BOOKMARK will prevent it.

410 does not happen due to network issues or connectivity issue between airflow and kubeapi server.

it happens when no event of type ADDED, MODIFIED, DELETED happens on the watched resource for a long time [ ~ 5 mins].

it would really help the conversation if you would take the time to go through the KEP and other docs.

ecerulm · 2022-05-09T20:40:00Z

410 does not happen due to network issues or connectivity issue between airflow and kubeapi server.

I have reproduced this locally with minikube and kubernetes python client so I can assure you it CAN happen due to network issues between airflow and the k8s api. Let me explain the setup

Minikube locally
Python script that performs a watch in a while loop just like airflow does
Python script connects to minikube via toxiproxy so that I can simulate a network disconnection (context: minikube2)
As soon as I simulate the disconnection, the watch will exit with an exception ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
The python script remembers the last resource version just like airflow does
the python script retries continuously the watch with the last known resource_version=5659, failing each time with HTTPSConnectionPool(host='127.0.0.1', port=2222): Max retries exceeded with url: /api/v1/namespaces/testns2/pods?resourceVersion=5659&watch=True (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x107b02c40>: Failed to establish a new connection: [Errno 61] Connection refused'))
In another window, I create/delete deployments in that namespace (via context minikube, which does not go through the toxiproxy). I do this to create new events.
After 6 minutes, I reenabled the toxiproxy
The python script retries the watch with resource_version=5659, it connects and I get a (410) Reason: Expired: too old resource version: 5659 (6859)

it happens when no event of type ADDED, MODIFIED, DELETED happens on the watched resource for a long time [ ~ 5 mins]

I've been running a watch with the kubernetes python client to a namespace where there is not new events at all for 1 hours and I did not get an ApiException(410). So, are you sure of this? Have you ever seen this yourself in your kubernetes environment?

@ecerulm please go through the docs. it will help you understand why 410 occurs and how BOOKMARK will prevent it.
it would really help the conversation if you would take the time to go through the KEP and other docs.

I did read all the documents, and I think I understand this ok, also I have actually done testing and try to actually back up what I say by doing it.

I think you mean something else by "prevent".

I hope the scenario I included in this comment will help you understand why 410 occurs in the event of network issues and how BOOKMARK can't prevent that. In principle BOOKMARK will help to get a better "last known resource version " at step 5 but by the time step 8 is reached that resource version won't be valid (if enough time has passed). And this is not theory it's something that you actually do test and reproduce yourself like I did.

snjypl · 2022-05-10T11:17:25Z

@ecerulm, that is not a valid scenario to test/demostrate bookmark feature.
the bookmark does not work the way you understand it. i wish i could give you a better example.

ecerulm · 2022-05-10T11:53:40Z

Well, but this is one of the scenarios that is causing KubernetesJobWatcher to get 410 and get stuck in an infinite loop. So again what I'm trying to say is that bookmarks do not help to solve #21087 is this scenario. Just handling 410 properly solves #21087 no need for bookmark at all to solve this github issue.

snjypl · 2022-05-10T12:35:52Z

@ecerulm , i believe when there is an error there are two parts to it

a.) handling the exception and recovering from it.
b.) finding the cause and trying to fix it so that it won't happen again or atleast reduce the frequency of the error.

what you are suggesting covers part (a). it will allow airflow to recover from 410 and not get stuck in an infinite loop. we both agree on that.

now, using bookmark etc would come in part (b) trying to fix the root cause or try to prevent the 410 error. part (b) requires a bug fix in kubernetes-python client and also some changes in airflow code.

both parts are equally important.

ultimately it is for the airflow maintainers and community members to decide what is best for them. i am just trying to help understand the root cause of 410 and possible ways prevent it from recurring.

i think, we should just leave it here and let others participate in the discussion, and hear their views.

ecerulm · 2022-05-10T13:34:06Z

@snjypl

Like I said before, my only concern is that someone that reads this thread maybe get the false impression that watch bookmarks are needed to stop the scheduler from getting stuck in an infinite loop and some people may even get the idea that we need a new kubernets python client before we can solve this, which is not the case.

I just tried to explain that watch bookmarks are not needed to solve the infinite loop (which is what this #21087 is about IMHO) . And that watch bookmarks alone will not prevent the 410 at least the scenario that I'm personally experiencing (which I already explained).

i am just trying to help understand the root cause of 410 and possible ways prevent it from recurring.

One of the root causes would be the one that I explained, which I don't think can't be prevented from recurring.

both parts are equally important.

well, I think a) is more important and that's why I'm so adamant into making sure that a) is done and not delayed waiting for kubernetes client updates, etc. Since just b) ...reduce the frequency of the error. does not make a big difference, if my scheduler gets into infinite loop every hour or every day it does not matter that much. Both cases are unacceptable for me.

I totally agree to leave it here, I think with these two last post is clear what you mean, and what I mean.

cansjt added area:core kind:bug This is a clearly a bug labels Jan 25, 2022

eladkal added provider:cncf-kubernetes Kubernetes provider related issues affected_version:2.2 Issues Reported for 2.2 labels Feb 28, 2022

cansjt mentioned this issue Feb 28, 2022

Add support for bookmark events kubernetes-client/python-base#286

Closed

cansjt mentioned this issue Feb 28, 2022

Add support for bookmark events kubernetes-client/python#1729

Open

cansjt mentioned this issue Mar 28, 2022

Allow deploying latest build again jobteaser/circleci#114

Merged

ecerulm mentioned this issue May 5, 2022

Fix "Reason: Expired: too old resource version: 379140622 (380367990)" #23504

Closed

ecerulm mentioned this issue May 6, 2022

Prevent KubernetesJobWatcher getting stuck on resource too old #23521

Merged

snjypl mentioned this issue May 9, 2022

WIP - Enable kubernetes watch bookmarks. Handle 410 error in kubernetes_executor #23578

Closed

snjypl mentioned this issue May 10, 2022

FIX: Update watch resource_version on BOOKMARK events. kubernetes-client/python#1796

Closed

potiuk closed this as completed in #23521 May 11, 2022

ephraimbuddy mentioned this issue May 21, 2022

Status of testing of Apache Airflow 2.3.1rc1 #23852

Closed

61 tasks

BlaineEXE mentioned this issue Jun 27, 2022

regular rook-ceph-mgr crashes after update to ceph v17.2.0: too old resource version: 383042908 (383043544) rook/rook#10444

Closed

david-elliott-deliveroo mentioned this issue Feb 17, 2023

Prefect Kubernetes Agent doesn't retry 410 Expired k8s API errors from Watch() PrefectHQ/prefect#8573

Closed

4 tasks

KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state #21087

KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state #21087

Comments

cansjt commented Jan 25, 2022

Apache Airflow version

What happened

What you expected to happen

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

cansjt commented Jan 28, 2022

potiuk commented Feb 6, 2022

jedcunningham commented Feb 10, 2022

potiuk commented Feb 10, 2022

arkadiusz-bach commented Feb 28, 2022

cansjt commented Feb 28, 2022 • edited Loading

gkarg commented Apr 4, 2022 • edited Loading

cansjt commented Apr 5, 2022

gkarg commented Apr 5, 2022 • edited Loading

cansjt commented Apr 5, 2022

potiuk commented Apr 5, 2022 • edited Loading

gkarg commented Apr 5, 2022

potiuk commented Apr 5, 2022 • edited Loading

gkarg commented Apr 5, 2022

ecerulm commented May 5, 2022 • edited Loading

ecerulm commented May 6, 2022 • edited Loading

snjypl commented May 8, 2022

snjypl commented May 8, 2022 • edited Loading

ecerulm commented May 9, 2022 • edited Loading

snjypl commented May 9, 2022 • edited Loading

1.) preventing 410 error from happening:

2.) handling the error once it happens.

snjypl commented May 9, 2022

snjypl commented May 9, 2022 • edited Loading

ecerulm commented May 9, 2022

allow_watch_bookmark=True does not prevent 410

snjypl commented May 9, 2022

snjypl commented May 9, 2022

ecerulm commented May 9, 2022

snjypl commented May 9, 2022 • edited Loading

snjypl commented May 9, 2022

ecerulm commented May 9, 2022 • edited Loading

snjypl commented May 9, 2022 • edited Loading

ecerulm commented May 9, 2022

snjypl commented May 10, 2022

ecerulm commented May 10, 2022

snjypl commented May 10, 2022 • edited Loading

ecerulm commented May 10, 2022

cansjt commented Feb 28, 2022 •

edited

Loading

gkarg commented Apr 4, 2022 •

edited

Loading

gkarg commented Apr 5, 2022 •

edited

Loading

potiuk commented Apr 5, 2022 •

edited

Loading

potiuk commented Apr 5, 2022 •

edited

Loading

ecerulm commented May 5, 2022 •

edited

Loading

ecerulm commented May 6, 2022 •

edited

Loading

snjypl commented May 8, 2022 •

edited

Loading

ecerulm commented May 9, 2022 •

edited

Loading

snjypl commented May 9, 2022 •

edited

Loading

snjypl commented May 9, 2022 •

edited

Loading

snjypl commented May 9, 2022 •

edited

Loading

ecerulm commented May 9, 2022 •

edited

Loading

snjypl commented May 9, 2022 •

edited

Loading

snjypl commented May 10, 2022 •

edited

Loading