Clearing tasks for previously finished DAG runs in airflow 2.0 does not lead to scheduling of tasks (when max_active_runs is reached) #13407

soltanianalytics · 2020-12-31T15:59:12Z

Apache Airflow version: 2.0, LocalExecutor

Environment: Docker on Windows 10 with WSL using image apache/airflow:2.0.0-python3.8

What happened:

Situation:

There is a DAG, say mydag, with
- catchup=True
- max_active_runs=1
Let's say there are two DAG runs, t=0 and t=1
The first task of the DAG is a sensor that senses if the previous DAG was succesful
Now, t=0 gets run, tasks are scheduled, and a task in t=0 fails
Then, t=1 gets run, and the first task - the sensor - cannot sense the successful task, thus keeps sensing
Now I clear the failed task in t=0 and expect that this would run, as it did in airflow 1.x
It doesn't - instead the scheduler gives the following:

scheduler_1  | [2020-12-31 15:25:32,770] {scheduler_job.py:1667} INFO - DAG mydag already has 1 active runs, not queuing any tasks for run 2020-12-26 05:00:00+00:00 [note: this is t=0]

Thus, t=0 never finishes and t=1 never sensed the finished run, and any t=n with n>1 also have no chance of ever succeeding.

One alternative would be to remove the max_active_runs constraint, but that is not feasible, as this would create hundreds of DAG runs at once and that is a complete and total performance killer.

What you expected to happen:

As with previous airflow versions, I would expect that the cleared tasks get scheduled again, which they don't.

Why this happens:

tl;dr Ultimately, this happens because airflow uses TI instead of DR here: https://github.com/apache/airflow/blob/v2-0-stable/airflow/jobs/scheduler_job.py#L1499-L1509

_do_scheduling() runs _schedule_dag_run() once for each dag_id, and gives the set of active dag runs as arg, here: https://github.com/apache/airflow/blob/v2-0-stable/airflow/jobs/scheduler_job.py#L1515. The tasks that should be queued are not queued because the dag runs are not in the abovementioned set of active dag runs. This is in spite of the fact that they are running. This is because https://github.com/apache/airflow/blob/v2-0-stable/airflow/jobs/scheduler_job.py#L1499-L1509 looks at all TaskInstances of that dagrun and their execution date instead of looking at the DagRuns, and since the tasks were successfull or failed and then cleared, they are filtered out in the query. If you replace TI with DR in that query, this should work perfectly fine, without breaking anything that currently works and fixing this issue.

How to reproduce it:

You don't need to have the sensor logic I described above to reproduce this behavior. While I didn't do this, the following should reproduce the behavior:

Create a DAG mydag with catchup=True and max_active_runs=1
Just have a dummy task or something, let it run a couple of times so you have a couple of successful DAG states
Pause the DAG*
Clear a couple of tasks in dag runs that were successful
run this snippet to see the result of the query with TI and DR, respectively

from airflow import models, settings
from airflow.utils.state import State
TI = models.TaskInstance
DR = models.DagRun
dag_id = "mydag"

result = "\n\nactive DAG runs according to current code logic:"
for data_tuple in settings.Session().query(TI.dag_id, TI.execution_date).filter(TI.dag_id.in_([dag_id]), TI.state.notin_(list(State.finished))):
 result += "\n\t" + str(data_tuple)

result += "\n\nactive DAG runs according to my proposed code logic:"
for data_tuple in settings.Session().query(DR.dag_id, DR.execution_date).filter(DR.dag_id.in_([dag_id]), DR.state.in_([State.RUNNING])):
 result += "\n\t" + str(data_tuple)

print(result, "\n")

*Pausing of the DAG only avoids that your airflow instance works through the dag runs one-by-one; you would not need to pause if your DAG has a sensor that senses the success of the previous DAG like mine do.

I will be creating a PR with the suggested fix shortly.

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2020-12-31T15:59:14Z

Thanks for opening your first issue here! Be sure to follow the issue template!

…che#13407

soltanianalytics · 2021-01-01T22:18:08Z

I did some more reading (mainly #1442 and https://issues.apache.org/jira/browse/AIRFLOW-137). I see now that using TI was entirely on purpose. Currently, it is expected to ignore the DagRuns which were re-set to running by the clearing of tasks in order to avoid a violation of max_active_runs, until such a violation is avoided. My issue with this is that

The scheduler does not schedule tasks in DagRuns which are, in fact, running
When a user clears tasks, the user would want these tasks to be scheduled, therefore I think the violation of max_active_runs - as is the case in my usecase - is on purpose and a feature, not a bug

But, from #1442 I can see that a user might also want to just have specific tasks run, but have them run across a large number of DagRuns, while only executing tasks in max_active_runs or less DagRuns. Arguably, when using backfill or just generally catchup=True, I would expect that I can rely on the tasks being executed ordered by execution_date, because if I just have my airflow installation running, that is also the order in which the tasks are being run. Thus, I think a second alternative is an approach where we keep the abovementioned logic but adjust it so that only tasks in the first max_active_runs DagRuns are run, ordered by execution_date.

I will create a second PR with this alternative approach.

turbaszek · 2021-01-02T08:56:01Z

CC @ashb

…eeded (#14321) closes: #12912 related: #13407

…eeded (#14321) closes: #12912 related: #13407 (cherry picked from commit 09327ba)

ephraimbuddy · 2021-08-30T14:05:43Z

@soltanianalytics please can you test this on 2.1.3, I was not able to reproduce

soltanianalytics · 2021-08-30T14:08:04Z

The behavior has indeed changed since 2.0.0, and while not perfect, I can deal with it for now - closing the ticket

ephraimbuddy · 2021-08-30T14:10:21Z

The behavior has indeed changed since 2.0.0, and while not perfect, I can deal with it for now - closing the ticket

Please can you provide more context to not perfect, I'm happy to look into this issue if you can explain more, thanks

soltanianalytics · 2021-08-30T14:16:32Z

My core usecase is re-running DAGs that are not currently running.

So I have a DAG with max_active_runs=1 and catchup=True. This DAG depends on the previous DagRuns being successful. I implement this logic via a sensor that senses the success of the last task of the previous DagRun. If the previous DagRun failed, the current one will keep sensing into the abyss. It might fail after some time, too. Then I might have n DagRuns that I want to re-run. This n can be in the dozens. If I just clear the tasks of all the DAGs I want to run, they may not be running in order, but because all but the oldest tasks will have sensors that will not be successful until the previous DagRun was successful, this will only execute properly if the DagRuns are executed in chronological order.

I can make that happen if I let the currently active DagRun fail before clearing tasks and setting DagRun states to running. Most of the time, that should do the trick. If not, I'll just delete all relevant DagRuns and then they'll re-appear chronologically.

ephraimbuddy · 2021-08-30T14:18:52Z

Thanks @soltanianalytics

soltanianalytics · 2021-08-30T14:19:25Z

If I don't let the current one fail first, or if Airflow otherwise has a hickup, then it will simply never schedule the "correct" DagRuns because of max_active_runs (so my idea of max_active_runs is that it should only apply when creating running DagRuns and Airflow should consider all DagRuns that are running irrespective of max_active_runs once they do run, however others seem to disagree with this interpretation)

ephraimbuddy · 2021-08-30T15:09:48Z

Actually, the problem with max_active_runs is that the line of code below:

airflow/airflow/jobs/scheduler_job.py

Line 973 in 24aa34b

dag_runs = self._get_next_dagruns_to_examine(State.QUEUED, session)

doesn't get distinct dag_ids. For example:

dag1 has 8queued dagruns and dag2 has 5 queued dagruns and dag3 has 6

The code would get all the 8 dagruns from dag1(if it has the closest date) and then 2 dagruns from dag2 and nothing from dag3.
The correct thing should be for each dag, get all queued dagruns and set the state to running if the max_active_runs is not reached

soltanianalytics · 2021-08-30T15:12:04Z

Note that in the issue I described above, all RagRuns are already running, but tasks are not scheduled

soltanianalytics added the kind:bug This is a clearly a bug label Dec 31, 2020

soltanianalytics pushed a commit to soltanianalytics/airflow that referenced this issue Dec 31, 2020

use DR instead of TI to get set of active dag runs by dag - issue apa…

16a6fc7

…che#13407

soltanianalytics mentioned this issue Dec 31, 2020

Remove TI query that creates deadlock due to max_active_runs when clearing tasks of finished DAGs #13408

Closed

potiuk added this to the Airflow 2.0.1 milestone Dec 31, 2020

soltanianalytics pushed a commit to soltanianalytics/airflow that referenced this issue Jan 1, 2021

Schedule cleared tasks of previously finished DagRuns (apache#13407)

094c3c4

turbaszek added the area:Scheduler including HA (high availability) scheduler label Jan 2, 2021

soltanianalytics pushed a commit to soltanianalytics/airflow that referenced this issue Jan 2, 2021

Fix typo, clean up, and add comment relating to apache#13407

747a6b1

soltanianalytics mentioned this issue Jan 2, 2021

Schedule tasks of cleared dags #13433

Closed

soltanianalytics pushed a commit to soltanianalytics/airflow that referenced this issue Jan 2, 2021

Remove unnecessary order_by (apache#13407)

a893656

turbaszek mentioned this issue Jan 2, 2021

Airflow 2.0.0 manual run causes scheduled run to skip #13434

Closed

soltanianalytics pushed a commit to soltanianalytics/airflow that referenced this issue Jan 2, 2021

Remove unnecessary list conversion (apache#13407)

4a84834

soltanianalytics pushed a commit to soltanianalytics/airflow that referenced this issue Jan 2, 2021

Improve formatting (apache#13407)

61488ed

smowden mentioned this issue Jan 6, 2021

max_tis_per_query=0 leads to nothing being scheduled in 2.0.0 #13325

Closed

yogyang mentioned this issue Jan 7, 2021

dagrun_timeout doesn't kill task instances on timeout #12912

Closed

vikramkoka added the affected_version:2.0 Issues Reported for 2.0 label Jan 16, 2021

oz-r mentioned this issue Jan 22, 2021

Clearing of historic Task or DagRuns leads to failed DagRun #13853

Closed

kaxil changed the title ~~Clearing tasks for previously finished DAG runs in airflow 2.0 does not lead to scheduling of tasks~~ Clearing tasks for previously finished DAG runs in airflow 2.0 does not lead to scheduling of tasks (when max_active_runs is reached) Jan 28, 2021

kaxil added priority:medium Bug that should be fixed before next release but would not block a release priority:low Bug with a simple workaround that would not block a release and removed priority:medium Bug that should be fixed before next release but would not block a release labels Jan 28, 2021

kaxil modified the milestones: Airflow 2.0.1, Airflow 2.0.2 Feb 4, 2021

RNHTTR mentioned this issue Feb 19, 2021

Fix bug allowing task instances to survive when dagrun_timeout is exceeded #14321

Merged

kaxil pushed a commit that referenced this issue Mar 5, 2021

Fix bug allowing task instances to survive when dagrun_timeout is exc…

09327ba

…eeded (#14321) closes: #12912 related: #13407

ashb pushed a commit that referenced this issue Mar 19, 2021

Fix bug allowing task instances to survive when dagrun_timeout is exc…

44a261a

…eeded (#14321) closes: #12912 related: #13407 (cherry picked from commit 09327ba)

ashb removed this from the Airflow 2.0.2 milestone Apr 22, 2021

ashb added this to the Airflow 2.0.3 milestone Apr 22, 2021

ashb modified the milestones: Airflow 2.0.3, Airflow 2.1.1 May 7, 2021

kaxil modified the milestones: Airflow 2.1.1, Airflow 2.2 Jun 22, 2021

ephraimbuddy added the pending-response label Aug 30, 2021

soltanianalytics closed this as completed Aug 30, 2021

wolfier mentioned this issue Jul 21, 2023

Running tasks marked as skipped on DagRun timeout #30264

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clearing tasks for previously finished DAG runs in airflow 2.0 does not lead to scheduling of tasks (when max_active_runs is reached) #13407

Clearing tasks for previously finished DAG runs in airflow 2.0 does not lead to scheduling of tasks (when max_active_runs is reached) #13407

soltanianalytics commented Dec 31, 2020

boring-cyborg bot commented Dec 31, 2020

soltanianalytics commented Jan 1, 2021

turbaszek commented Jan 2, 2021

ephraimbuddy commented Aug 30, 2021

soltanianalytics commented Aug 30, 2021

ephraimbuddy commented Aug 30, 2021

soltanianalytics commented Aug 30, 2021

ephraimbuddy commented Aug 30, 2021

soltanianalytics commented Aug 30, 2021

ephraimbuddy commented Aug 30, 2021

soltanianalytics commented Aug 30, 2021 •

edited

Loading

Clearing tasks for previously finished DAG runs in airflow 2.0 does not lead to scheduling of tasks (when max_active_runs is reached) #13407

Clearing tasks for previously finished DAG runs in airflow 2.0 does not lead to scheduling of tasks (when max_active_runs is reached) #13407

Comments

soltanianalytics commented Dec 31, 2020

boring-cyborg bot commented Dec 31, 2020

soltanianalytics commented Jan 1, 2021

turbaszek commented Jan 2, 2021

ephraimbuddy commented Aug 30, 2021

soltanianalytics commented Aug 30, 2021

ephraimbuddy commented Aug 30, 2021

soltanianalytics commented Aug 30, 2021

ephraimbuddy commented Aug 30, 2021

soltanianalytics commented Aug 30, 2021

ephraimbuddy commented Aug 30, 2021

soltanianalytics commented Aug 30, 2021 • edited Loading

soltanianalytics commented Aug 30, 2021 •

edited

Loading