-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clearing tasks for previously finished DAG runs in airflow 2.0 does not lead to scheduling of tasks (when max_active_runs is reached) #13407
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
I did some more reading (mainly #1442 and https://issues.apache.org/jira/browse/AIRFLOW-137). I see now that using TI was entirely on purpose. Currently, it is expected to ignore the
But, from #1442 I can see that a user might also want to just have specific tasks run, but have them run across a large number of I will create a second PR with this alternative approach. |
CC @ashb |
@soltanianalytics please can you test this on 2.1.3, I was not able to reproduce |
The behavior has indeed changed since |
Please can you provide more context to not perfect, I'm happy to look into this issue if you can explain more, thanks |
My core usecase is re-running DAGs that are not currently running. So I have a DAG with I can make that happen if I let the currently active |
Thanks @soltanianalytics |
If I don't let the current one fail first, or if Airflow otherwise has a hickup, then it will simply never schedule the "correct" |
Actually, the problem with airflow/airflow/jobs/scheduler_job.py Line 973 in 24aa34b
doesn't get distinct dag_ids. For example: dag1 has 8queued dagruns and dag2 has 5 queued dagruns and dag3 has 6 The code would get all the 8 dagruns from dag1(if it has the closest date) and then 2 dagruns from dag2 and nothing from dag3. |
Note that in the issue I described above, all |
Apache Airflow version: 2.0, LocalExecutor
Environment: Docker on Windows 10 with WSL using image
apache/airflow:2.0.0-python3.8
What happened:
Situation:
mydag
, withcatchup=True
max_active_runs=1
Thus, t=0 never finishes and t=1 never sensed the finished run, and any t=n with n>1 also have no chance of ever succeeding.
One alternative would be to remove the
max_active_runs
constraint, but that is not feasible, as this would create hundreds of DAG runs at once and that is a complete and total performance killer.What you expected to happen:
As with previous airflow versions, I would expect that the cleared tasks get scheduled again, which they don't.
Why this happens:
tl;dr Ultimately, this happens because airflow uses
TI
instead ofDR
here: https://github.com/apache/airflow/blob/v2-0-stable/airflow/jobs/scheduler_job.py#L1499-L1509_do_scheduling()
runs_schedule_dag_run()
once for eachdag_id
, and gives the set of active dag runs as arg, here: https://github.com/apache/airflow/blob/v2-0-stable/airflow/jobs/scheduler_job.py#L1515. The tasks that should be queued are not queued because the dag runs are not in the abovementioned set of active dag runs. This is in spite of the fact that they arerunning
. This is because https://github.com/apache/airflow/blob/v2-0-stable/airflow/jobs/scheduler_job.py#L1499-L1509 looks at allTaskInstance
s of that dagrun and their execution date instead of looking at theDagRun
s, and since the tasks were successfull or failed and then cleared, they are filtered out in the query. If you replaceTI
withDR
in that query, this should work perfectly fine, without breaking anything that currently works and fixing this issue.How to reproduce it:
You don't need to have the sensor logic I described above to reproduce this behavior. While I didn't do this, the following should reproduce the behavior:
mydag
withcatchup=True
andmax_active_runs=1
TI
andDR
, respectively*Pausing of the DAG only avoids that your airflow instance works through the dag runs one-by-one; you would not need to pause if your DAG has a sensor that senses the success of the previous DAG like mine do.
I will be creating a PR with the suggested fix shortly.
The text was updated successfully, but these errors were encountered: