[RFC] Airlift: An Airflow Migration Toolkit #25279
Replies: 3 comments 1 reply
-
Nice love this |
Beta Was this translation helpful? Give feedback.
-
Really really thoughtful |
Beta Was this translation helpful? Give feedback.
-
Impressive forward-thinking, as usual from Dagster! My little nitpick is that I find the namings of the Peer and Observe stages are confusing, it is written as the beginning that Peer means observe too. Maybe it's ok to have longer, but more descriptive titles.
It will be interesting to see Airflift able to migrate from Prefect in the future. |
Beta Was this translation helpful? Give feedback.
-
Airlift: An Airflow Migration Toolkit
We’re happy to announce an early preview of Airlift (tutorial), a toolkit to accelerate, lower the cost, and reduce the risk of migrating from Airflow to Dagster.
It facilitates a three-step, incremental process:
We’ve seen early success with the approach, and are looking for further feedback and additional design partners. We are specifically looking for high-conviction users who actively want to migrate, but have been stymied by the absence of a readily available incremental approach and tooling support.
Context
Many teams and engineers look to Dagster as their orchestrator of choice. Frequently these teams have an existing orchestrator (or orchestrators). And often that is Airflow.
System evolution and migration is difficult work. The essential quality of a successful migration is incremental process. Incremental process implies a number of critical qualities that make it preferable to “big bang” migrations in nearly all cases:
Airlift is designed with the above in mind.
Introducing Airlift
We propose a three stage, tool-assisted process for migrating an Airflow instance to Dagster:
We’ll go through each of these stages in greater detail and describe the benefits.
Step 1: Peer
The first stage is to “peer” the Airflow instance to Dagster. This requires just a few lines of Dagster code.
Once you’ve added this code to your Dagster deployment, all of your Airflow DAGs will show up automatically as Dagster assets. Any successful runs of Airflow DAGs will register as a materialization.
This effortlessly starts the process of using Dagster as your single control and observability plane. With the DAGs ingested into the asset graph you can, for example:
We’ve found that this step is important psychologically for all stakeholders, as they can see, with minimal effort, how Dagster will be used as a universal observability plane during the process.
It also demonstrates the parallelization available throughout this process. Teams interested in building downstream greenfield Dagster assets can do so immediately, completely separate from the “observe” and “migrate” processes happening upstream. This is the power of incrementalism.
Step 2: Observe
The next step of the process is to incrementally add asset lineage to the Dagster instance for the assets orchestrated within Airflow.
Inside traditional Airflow DAGs there exists an implicit, unencoded asset graph that corresponds to physical assets in durable storage. The “observe” step can be thought of as making this implicit graph explicit.
For example in this simple example is an implicit five node asset graph, orchestrated by the three tasks in this DAG:
We model this asset graph in Dagster, and map them to tasks in the Airflow DAG using Airlift’s
assets_with_task_mappings
function:The Dagster UI will now show us the asset relationships within this task, and how they relate to the Airflow DAG. Using the task mapping information, Dagster will mark assets as “materialized” when the corresponding task in Airflow successfully completes.
Orchestrated In Airflow:
Observed in Dagster:
Performing this step before migration allows you to integrate Airflow-orchestrated assets into your global asset graph. You can trace dependencies from other assets Dagster orchestrates, within Airflow instances (so-called “cross DAG dependencies") or even dependencies between multiple Airflow instances. And most functionality will work out of the box for these assets, such as alerting, asset checks, and cataloging features.
Step 3: Migrate
The next stage of Airlift is to actually migrate compute. Up until now, we haven’t touched the Airflow instance at all (beyond ensuring that we can hit its REST API). Now we start that process.
This first step is to change your Airflow instance to enable proxying to Dagster. Proxying has a special meaning in Airlift: swapping out tasks at runtime to invoke the Dagster API. You add a call to
proxying_to_dagster
at the bottom of any file that the Airflow scheduler imports DAGs from.This function post processes the Airflow DAG and swaps in a task that invokes Dagster programmatically instead of the original task. It does this on a per-task basis based on configuration in your Airflow repo.
On the Dagster side, there must be code that executes the equivalent business logic as the Airflow operator. This typically means there is a factory function that creates Dagster definitions per operator type in your Airflow installation.
For example, imagine you used this pattern for running dbt projects:
For each one of those, you would use the equivalent dagster code.
The effort required to migrate here scales mostly with the number of operators rather than the number of tasks as per-task migration is mechanical and goes quickly. We also provide out-of-the-box operators for many common operators, such as the KubernetesPodOperator, PythonOperator, and BashOperator for DBT.
Once you have a corresponding set of assets for your task, you can “proxy” execution to Dagster, meaning that instead of executing the business logic of your asset in Airflow, it will execute in Dagster via an API call. This is managed by per-dag yaml files in your Airflow repository.
And construct a simple yaml file:
And now this task will instead materialize a set of Dagster assets. At runtime, the task is replace with a task the invokes Dagster over an API.
This configuration-based approach for per-task proxying confers a number of advantages
Now, let’s say something went wrong with the migration - your downstream check fails or the asset fails to materialize. With a one-line change to your migration file, you can switch execution back to your original Airflow business logic:
Step 4: Complete the migration
Once an Airflow DAG is fully migrated, you can just add the equivalent schedule to Dagster and then delete the
assets_with_task_mappings
call and all associated code.What about existing approaches?
There already exists a package called
dagster-airflow
which enables Dagster to directly import and execute unmodified Airflow DAGs. This package is best described as an adapter, and is not a complete solution to the problem.In the end the dagster-airflow was best suited for the scenario where a data platform team had a “long tail” of simple DAGs not under active development. This allowed that team to migrate those DAGs en masse and then de-commission their Airflow infrastructure to reduce maintenance overhead. Something more flexible and incremental like Airlift is required for the general use case.
Future Roadmap
Our initial target for this toolkit are high-conviction users that seek to fully migrate from Airflow to Dagster. However, this is only the first step.
For customers at larger scale or older organizations, there will always be a heterogeneous scheduling landscape for multiple reasons:
We view Airlift as the first step towards a generalized orchestration federation solution. You could already see the hints of that in this discussion, as we mention that you are visualize multiple airflow instances in a single Dagster deployment, and that the “observe” step on its own provides incremental value. This can generalize across other orchestrators and contexts.
With this capability, Dagster will land as an asset-oriented control plane that, in addition to providing a best-in-class orchestrator for greenfield data pipelines, also can incorporate and unify existing orchestrators in a single tool.
To that end we improve observability capabilities and we will expand beyond Airflow to incorporate other orchestrators and schedulers as well, such as AWS Step Functions, Azure Data Factory, and Dbt Cloud. And beyond that we plan on delivering the framework and patterns those are built on so that users and the ecosystem can build integrations for other orchestrators and schedulers.
Next Steps
Thanks for reading! In terms of next steps:
#airflow-migration
in slack so please follow along there.Beta Was this translation helpful? Give feedback.
All reactions