-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support glob syntax in .airflowignore files (#21392) #22051
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
airflow/utils/file.py
Outdated
elif '**' in pattern and '/**/' not in pattern: | ||
reason = "'**' must be between '/' chars when not at beginning or end of pattern" | ||
if reason: | ||
log.warning(f"Ignoring glob '{pattern}' from {ignore_file_path}: {reason}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.warning(f"Ignoring glob '{pattern}' from {ignore_file_path}: {reason}") | |
log.warning(f"Ignoring glob '%s' from %s: %s:', pattern, ignore_file_path, reason) |
Please avoid stirring formatting before passing to logger.
Thanks for your review @mik-laj! I will make the changes to the PR based on your comments. I've also found some logic bugs in the dependency being used for the gitignore parsing, so currently looking into workarounds and/or alternatives. |
9f021ad
to
cf23b3a
Compare
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
Seme doc building errors and tests failing @ianbuss :) |
A new configuration parameter "CORE_IGNORE_FILE_SYNTAX" is added to allow patterns in .airflowignore files to be interpreted as either regular expressions (the default) or glob expressions as found in .gitignore files. This allows users to use patterns they will be familiar with from tools such as git, helm and docker. Glob expressions support wildcard matches ("*", "?") within a directory as well as character classes ("[0-9]"). In addition, zero or more directories can be matched using "**". Patterns can be negated by prefixing a "!" at the beginning of the pattern. The "fnmatch" library in core Python does not produce patterns that are fully compliant with the kind of patterns that users will be used to from gitignore or dockerignore files, so the globs are parsed using the pathspec package from PyPI. To aid with debugging ignorefile patterns a more helpful error message is emitted in the logs for invalid patterns, which are now skipped rather than causing a hard-to-read scheduler stack trace.
Thanks @potiuk looking into the failing tests now! |
cf23b3a
to
9e7b4b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nitpicks
airflow/models/dagbag.py
Outdated
Note that if a ``.airflowignore`` file is found while processing | ||
Note that if an ``.airflowignore`` file is found while processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me a is right because .airflowignore
is prodounced dot airflow ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah, I can go long with that :) Will update.
The pattern syntax used in the ".airflowignore" files in the DAG directories. Valid values are | ||
``regexp`` or ``glob``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern syntax used in the ".airflowignore" files in the DAG directories. Valid values are | |
``regexp`` or ``glob``. | |
The pattern syntax used in the ".airflowignore" files in the DAG directories. Valid values are | |
``regexp`` and ``glob``. |
I think this is correct English? (I’m non-native so don’t take my words)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either can work actually in this context, I chose "or" since you have to choose one or the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either makes sense for me too
Thanks for your additional review @uranusjr have made changes to the docs. |
Awesome work, congrats on your first merged pull request! |
Cool it made it to 2.3.0 :) |
Congrats on your first commit @ianbuss 🎉 |
A new configuration parameter "CORE_IGNOREFILE_SYNTAX" is added to
allow patterns in .airflowignore files to be interpreted as either
regular expressions (the default) or glob expressions as found in
.gitignore files. This allows users to use patterns they will be
familiar with from tools such as git, helm and docker.
Glob expressions support wildcard matches ("*", "?") within a directory
as well as character classes ("[0-9]"). In addition, zero or more
directories can be matched using "**". Patterns can be negated by
prefixing a "!" at the beginning of the pattern.
The "fnmatch" library in core Python does not produce patterns that are
fully compliant with the kind of patterns that users will be used to
from gitignore or dockerignore files, so the globs are parsed using
the gitignore-parser package from PyPI.
To aid with debugging ignorefile patterns a more helpful error
message is emitted in the logs for invalid patterns, which are
now skipped rather than causing a hard-to-read scheduler stack trace.
closes: #21392
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.