-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-4364] Add Pylint to CI #5238
Conversation
Hey @BasPH - I will take a closer look later today as I am very interested :) (travelling today) but I have one question/idea which we might implement with this change. We can defer it for the future as well, but I think maybe it's worth starting now. Since we are starting to use more linters, I thought maybe we could already switch to using pre-commit-hook framework for those linters. It allows to run checks on CI but (what is more important) it can run the very same checks as pre-commit-hooks. It is really nicely implemented - has nice UI, allows to add many ready-to-use linters and checkers (and some automated code modification like adding licence headers) and it is super-easy to install locally by the developer. And it has pluginable interface where it can already (I believe) filters only changed files (not lines by default though). As local pre-commit check, It could be run as pre-commit for all locally modified files, so that people are encouraged to fix error faster. And on Travis we could continue checking only modified lines for example. I think you could fairly easily turn your python script into a pre-commit plugin rather than have a standalone script and then we could benefit from being able to run the checks with pre-commit hooks (which is far better than waiting for Travis). I discovered it recently and applied successfully to the Ooozie2Airflow converter we work on - we applied some 20+ checks. You can see for example here: And here is the list of checks we have implemented in our project:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general comment. Maybe we should really not care about changed lines? I think pylint will be introduced way faster if we limit it per file rather than per changed line. It's a bit "harsh" on the individual contributors on one hand, but it will be much better for the community.
cell-var-from-loop, # Raises spurious errors | ||
super-init-not-called, # BasPH: ignored for now but should be fixed somewhere in the future | ||
arguments-differ, # Doesn't always raise valid messages | ||
import-error, # Requires installing Airflow environment in CI task which takes long, therefore ignored. Tests should fail anyways if deps are missing. Possibly un-ignore in the future if we ever use pre-built Docker images for CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will hopefully soon - as most of the stability problems with tests are fixed now, I am resuming working on this :)
scripts/ci/ci_pylint.py
Outdated
@@ -0,0 +1,104 @@ | |||
# Licensed to the Apache Software Foundation (ASF) under one or more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add shebang here?
#!/usr/bin/env python3
And make the script executable. That would be helpful to run in standalone and indicate python3 compatibility
scripts/ci/ci_pylint.py
Outdated
|
||
# Get Python files to run Pylint against. | ||
# Git command from https://github.com/sk-/git-lint/blob/master/gitlint/git.py | ||
git_modified_files_cmd = "git status --porcelain --untracked-files=all --ignore-submodules=all".split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand think this command will return no files in CI, but the purpose is to run tests on all locally modified files only. Maybe we should mention somewhere in the readme that the script might be run locally, not only in CI environment then (and how).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I borrowed this method from git-lint. The shell script calls git --reset
on the repo to make everything locally modified, and this Python script can then pick it up. This script can be used both locally and in Travis, by starting the ci_pylint.sh script.
I will add documentation and more comments.
scripts/ci/ci_pylint.py
Outdated
for filename in py_files_to_check: | ||
git_blame_cmd = "git blame --porcelain {0}".format(filename).split() | ||
try: | ||
with open(os.devnull, "w") as devnull: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are using python 3.5+ maybe this will be better:
https://docs.python.org/3/library/contextlib.html#contextlib.redirect_stderr
scripts/ci/ci_pylint.py
Outdated
filename_lines[filename] = None | ||
continue | ||
|
||
# We only check lines starting with 40 zeros. Therefore run this script only from ci_pylint.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to figure out that 40 zeros means "not-yet-committed". Maybe we should mention that this is the intention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will do.
scripts/ci/ci_pylint.sh
Outdated
git cat-file -e ${FIRST_COMMIT} || git reset --soft $TRAVIS_COMMIT~1 | ||
|
||
python $(dirname ${BASH_SOURCE[0]})/ci_pylint.py | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we reset all back to the original state here? I think it is nice for any other potential scripts running after this one.
scripts/ci/ci_pylint.sh
Outdated
FIRST_COMMIT=${TRAVIS_COMMIT_RANGE:0:12} | ||
|
||
# First commit in range exists. This should pick up regular commits. | ||
git cat-file -e ${FIRST_COMMIT} && git reset --soft ${TRAVIS_COMMIT_RANGE%...*} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question - I could not find it quickly but maybe we should get somehow the base branch for the PR and make the check for all changes? From what I see now, we will only run pylint check for the last pushed commit if we add a new one in the existing PR, but maybe we should run it for all the commits in PR ? It should not change much but maybe it can be simplified a bit in this case (no distinction of normal/force-push commits).
@potiuk The script was a bit tricky to get right. Will comment on all individual comments after this. About the path to complete Pylint compatibility: I'd love to have Airflow completely Pylint compatible asap. I think checking only changed lines is the fastest way to have some form of Pylint in Airflow, and then we (the community) can work on making all code compatible in the meantime. Since checking only changed lines is a bit messy and requires these extra scripts, another option is to first make everything compatible, touch 99% of the codebase in a single PR (or split up PRs per module/package) and then enable Pylint in the CI afterwards. What do you think? |
I agree the per-line script is pretty messy. I'd be all for making a big pylint update (but then a lot of PRs will have merge conflicts). And we will have a lot of problems with merging stuff to v1-10* branches. But maybe it's worth it :). As long as we provide some tools for all the contributors to fix/test their changes locally (like pre-commit hooks), maybe it's the best way to implement it. It's a bit harsh, but maybe we can run this through community and ask them for feedback what they think about it ? I think personally it's the best way of introducing such changes - rather than fix everything, give all the contributors the tools and support in case of questions (like be ready to quickly answer questions - whether we should disable a rule or fix this particular case) and let them convert their PRs in a distributed fashion on their own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BasPH Thank you for the addition of Pylint to make our code a better place :P
I added a few comments maybe you can take a look and state your opinion. I am aware of that we will have a lot to refactor (even more when accepting my suggestions) to have a fully linted code base, but I would be glad to help out on that :)
.pylintrc
Outdated
#function-rgx= | ||
|
||
# Good variable names which should always be accepted, separated by a comma. | ||
good-names=i, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't like these exceptions of good names
. A good name for me is something that well describes its purpose.
The only name I would leave there is the _
because it actually has a meaning - ignore that value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those values are quite common for a a number of simple loops (i,jk) , 'ex' being universally understood for Exception and '_' of course. I'd remove Run though ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points, for me the use of i
in this example is pretty clear and used by many Python developers:
for i in range(100):
....
So I suggest to leave that in.
I removed Run
, not sure how that's ever a good variable name.
@potiuk Another option might be to blacklist all files, enforce Pylint in the CI (on full files - not changed lines), and one-by-one, or directory-by-directory, make all files compatible and remove from the blacklist, until everything is done. |
@BasPH - I like that much more. Such a blacklist would be our TO-DO list. and we could have a separate JIRA issue for which we could report any commits that are fixing PyLint. We could then make a similar effort as we do with dropping Python 2. I am happy to help with that :) |
@potiuk Cool, that would work + no more hacky script for figuring out changed lines :) What do you think of the following: I have this command to run Pylint: find . -name "*.py" -not -path "./.eggs/*" -not -path "./airflow/www/node_modules/*" -not -path "./airflow/_vendor/*" | \
grep -v 'airflow/[a-zA-Z_]*.py' | \
grep -v 'tests/[a-zA-Z_]*.py' | \
grep -v airflow/api | \
grep -v airflow/bin | \
grep -v airflow/config_templates | \
grep -v airflow/contrib/auth | \
grep -v airflow/contrib/hooks | \
grep -v airflow/contrib/operators | \
grep -v airflow/contrib/plugins | \
grep -v airflow/contrib/sensors | \
grep -v airflow/contrib/task_runner | \
grep -v airflow/contrib/utils | \
grep -v airflow/dag | \
grep -v airflow/example_dags | \
grep -v airflow/executors | \
grep -v airflow/hooks | \
grep -v airflow/jobs | \
grep -v airflow/kubernetes | \
grep -v airflow/lineage | \
grep -v airflow/macros | \
grep -v airflow/migrations | \
grep -v airflow/models | \
grep -v airflow/operators | \
grep -v airflow/security | \
grep -v airflow/sensors | \
grep -v airflow/task | \
grep -v airflow/ti_deps | \
grep -v airflow/utils | \
grep -v airflow/www | \
grep -v dags | \
grep -v docs | \
grep -v scripts | \
grep -v setup.py | \
grep -v tests/api | \
grep -v tests/cli | \
grep -v tests/contrib/hooks | \
grep -v tests/contrib/operators | \
grep -v tests/contrib/sensors | \
grep -v tests/contrib/task_runner | \
grep -v tests/contrib/utils | \
grep -v tests/executors | \
grep -v tests/hooks | \
grep -v tests/jobs | \
grep -v tests/kubernetes | \
grep -v tests/lineage | \
grep -v tests/macros | \
grep -v tests/migrations | \
grep -v tests/minikube | \
grep -v tests/models | \
grep -v tests/operators | \
grep -v tests/plugins | \
grep -v tests/security | \
grep -v tests/sensors | \
grep -v tests/task | \
grep -v tests/test_utils | \
grep -v tests/ti_deps | \
grep -v tests/utils | \
grep -v tests/www | \
xargs pylint --output-format=colorized For every
If you agree, I will change this PR by removing the changed-line-checking scripting and running this command in the CI pipeline. |
@potiuk I've made a list (scripts/ci/pylint_todo.txt) of all the files to exclude from linting. All is green on my machine + own Travis. If you accept and merge once Travis is done, we can create issues from the list above and have everybody make Airflow Pylint compatible 🙂 |
All Good @BasPH . I will be travelling to the Bay Area Meetup tomorrow and pretty busy with meetings in Bay Area in next week so I won't be able to do a lot over the next week or so, but I will do my best to help with it. |
I wish this had been introduced as a non-failing check at least at the beginning. Pylint is a notoriously prickly linter... |
Hi @pgagnon could you clarify? All files are blacklisted so it should succeed right now. Only (obviously) when you remove files from the blacklist, will you see failures. But that's also the goal, to make everything Pylint-compatible and eventually have much more consistent and readable code. |
Yeah. Also I it should be possible to disable some rules in parts of the files in case we find it non-compliant and there is a good reason for it. It's one of the valid ways of dealing with such issues. Also that made me think @BasPH -> I thought maybe we should also have a very short explanation in CONTRIBUTING.md on the Pylint process we are going to follow - once we create JIRA issues. We could link to the JIRA issue and some links to pylint docs explaining how to deal with checks in case there is no easy fix (mainly #pylint disable/enable comments), because not everyone realises that. |
@BasPH I didn't realize at first that a blacklist mechanism was in place. Nevertheless I feel that we should disable |
This PR adds Pylint to the CI, initially all files are blacklisted and the idea is to remove from the blacklist and make pylint compatible one-by-one. The default Pylint configuration is quite strict so I ran Pylint with default settings, collected all messages and went over every single message type and decided if it should be kept, changed or dropped.
The list of all messages with default Pylint configuration on Airflow master:
- Variables (1643): See all different styles, everything seems valid to me, kept at snake_case.
- Functions (35): Not too many errors, kept at snake_case.
- Arguments (146): Kept snake_case but allowed for 1 char argument names
- Methods (36): Kept on snake_case.
- Classes (16): Kept on PascalCase.
Make sure you have checked all steps below.
Jira
Description
Check Pylint on changed lines.
Tests
Commits
Documentation
Code Quality
flake8