Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.gitignore consistency with Git #526

Closed
9 tasks done
segevfiner opened this issue Jul 4, 2017 · 24 comments
Closed
9 tasks done

.gitignore consistency with Git #526

segevfiner opened this issue Jul 4, 2017 · 24 comments

Comments

@segevfiner
Copy link
Contributor

segevfiner commented Jul 4, 2017

Getting gitignore behavior just right seems to be quite hard. (I think even libgit2 still has some issues with it 😛) This are some things that I encountered which are different in the current implementation in master:

  • A pattern including a trailing slash should match only directories. That slash should not be considered in deciding whether a pattern is a filename glob or a path glob. – PR Correctly handle include, exclude than include in .gitignore #528
  • I think that if you ignore a file, re-include it (A pattern with a '!') and than ignore it again. It should be ignored. (The code currently does return). – PR A pattern with only a trailing slash should be treated as a glob #529
  • A pattern without a trailing slash should match both files and directories. – PR gitignore patterns without a trailing slash should match directories too #530
  • The behavior in regards to ignoring directories is quite finicky. If a directory is ignored, Git doesn't recurse into it, meaning that all the files in it are ignored. Also, "negated" patterns don't affect anything in such directories. There is no such code ATM in Dulwich but this is important to remember.
  • There probably needs to be a way to properly handle .gitignore files from subdirectories. Their patterns should only apply below them and path patterns from parent directories should be anchored to the directory where they are defined. It's unclear how to currently achieve this if at all possible with the current implementation. We probably need to accept the repository relative path to the directory or .gitignore file, and either alter the regex (carefully!) or check the path as an extra check before checking for patterns in the IgnoreFilter (I think that's what the real Git does).
  • Git matches patterns case insensitively on case insensitive file systems. i.e. on Windows.
    Git enables case insensitivity based on the core.ignorecase config.
  • Trailing spaces are ignored unless they are quoted with backslash ("\").
  • Handle .gitignore files with \r\n line endings – PR Handle .gitignore files with \r\n line endings #535
  • We should respect a trailing slash in an absolute path passed to IgnoredFilterManager.is_ignored.

Reference: https://git-scm.com/docs/gitignore

@jelmer
Copy link
Owner

jelmer commented Jul 4, 2017 via email

@segevfiner
Copy link
Contributor Author

segevfiner commented Jul 4, 2017 via email

@segevfiner
Copy link
Contributor Author

A pattern without a trailing slash should match both files and directories:

git init temp && cd temp
mkdir -p foo/bar
cat > .gitignore <<EOF
bar
EOF
git check-ignore -v foo/bar

Output:

.gitignore:1:bar.       foo/bar/

Dulwich:

>>> from dulwich import ignore
>>> with open('.gitignore', 'r') as f:
...     a = ignore.IgnoreFilter(ignore.read_ignore_patterns(f))
... 
>>> a.is_ignored('foo/bar/')
# None

@segevfiner
Copy link
Contributor Author

There probably needs to be a way to properly handle .gitignore files from subdirectories. Their patterns should only apply below them and path patterns from parent directories should be anchored to the directory where they are defined. It's unclear how to currently achieve this if at all possible with the current implementation.

@segevfiner
Copy link
Contributor Author

segevfiner commented Jul 5, 2017

I tried running the following script against PR #530 and discovered that Git ignores case on Windows or other operating systems with case insensitive file systems when matching gitignore patterns. Everything else seems to be ignored correctly, even stuff that libgit2 seems to trip on 😃.

We can probably try this script on other repositories, and once there is support for .gitignore files in subdirectories, trying a similar script that will support them too, on more complicated repositories, should be a good way to validate that we got the implementation right.

The script is a bit slow since it calls check-ignore for every single path in the repo. There probably is a faster way to implement it. (Maybe with git ls-files and git status --ignored).

from __future__ import print_function
import sys
import os
import argparse
import subprocess
from dulwich import ignore


def to_git_path(repo, p, dir_=False):
    git_path = os.path.relpath(p, repo).replace('\\', '/')

    if dir_:
        git_path += '/'

    return git_path


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("repo")

    args = parser.parse_args()

    with open(os.path.join(args.repo, '.gitignore'), 'r') as f:
        ignore_filter = ignore.IgnoreFilter(ignore.read_ignore_patterns(f))
        ignore_filter.append_pattern('.git')

    for root, dirs, files in os.walk(args.repo):
        for name in files:
            git_path = to_git_path(args.repo, os.path.join(root, name))

            dulwich_ignored = bool(ignore_filter.is_ignored(git_path))
            git_ignored = subprocess.call(
                ["git", "-C", args.repo, "check-ignore", "--no-index", git_path],
                stdout=subprocess.PIPE) == 0

            if dulwich_ignored ^ git_ignored:
                print("Diff:", git_path, dulwich_ignored, git_ignored)

        dirs[:] = [i for i in dirs if not ignore_filter.is_ignored(
            to_git_path(args.repo, os.path.join(root, i), True))]

if __name__ == "__main__":
    sys.exit(main())

@segevfiner
Copy link
Contributor Author

Another test that might trip Dulwich (when implemented):

git init temp && cd temp
mkdir a
touch a/b
cat > .gitignore <<EOF
/a/
EOF
cat > a/.gitignore <<EOF
!b
EOF
git check-ignore -v a/b

Output:

.gitignore:1:/a/        a/b

@jelmer
Copy link
Owner

jelmer commented Jul 13, 2017

I believe this last case should now be handled correctly, too.

@segevfiner
Copy link
Contributor Author

@jelmer

  1. IgnoreFilterManager seems to be mishandling the trailing slash for ignoring only directories. The split leaves an empty string at the end of the parts list and it will than test for only the directory name without a trailing slash while iterating. Also, the parent paths that are tested should be tested with a trailing slash too as they are obviously directories.
  2. You split on os.path.sep which changes with the operating system. We probably should split on '/' always. We can also do:
def _posix_path(p):
    """Convert to a path to POSIX style."""
    return p.replace('\\', '/')

And use it in all is_ignored methods. To also make sure that if anyone passes a Windows path, it will also work.

@jelmer
Copy link
Owner

jelmer commented Jul 16, 2017

I've added support for core.ignorecase.

@jelmer
Copy link
Owner

jelmer commented Jul 16, 2017

I've also fixed the issue with directory paths in IgnoreFilterManager. Please let me know if you're aware of any more pending issues.

@segevfiner
Copy link
Contributor Author

segevfiner commented Jul 16, 2017

This two lines: dulwich/ignore.py:298-299, might trash a trailing slash in the passed path when it's absolute.

@segevfiner
Copy link
Contributor Author

segevfiner commented Jul 16, 2017

Here is a version of the script from before using IgnoreFilterManager and purposefully not using the optimization of skipping ignored directories (obviously still quite slow):

from __future__ import print_function
import sys
import os
import argparse
import subprocess
from dulwich.repo import Repo
from dulwich import ignore


def to_git_path(repo, p, dir_=False):
    git_path = os.path.relpath(p, repo).replace('\\', '/')

    if dir_:
        git_path += '/'

    return git_path


def diff(repo, ignore_filter, git_path):
    dulwich_ignored = bool(ignore_filter.is_ignored(git_path))
    git_ignored = subprocess.call(
        ["git", "-C", repo, "check-ignore", "--no-index", git_path],
        stdout=subprocess.PIPE) == 0

    if dulwich_ignored ^ git_ignored:
        print("Diff:", git_path, dulwich_ignored, git_ignored)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("repo")

    args = parser.parse_args()

    with Repo(args.repo) as repo:
        ignore_filter = ignore.IgnoreFilterManager.from_repo(repo)

        for root, dirs, files in os.walk(args.repo):
            for name in files:
                git_path = to_git_path(args.repo, os.path.join(root, name))
                diff(args.repo, ignore_filter, git_path)

            for name in dirs:
                git_path = to_git_path(args.repo, os.path.join(root, name), True)
                diff(args.repo, ignore_filter, git_path)

if __name__ == "__main__":
    sys.exit(main())

Using it on CPython's repository I caught this one:

git init temp && cd temp
mkdir -p a/b/c
cat > .gitignore <<EOF
a/b/*
EOF
git check-ignore -v a/b/

Output:

.gitignore:1:a/b/*      a/b/

Dulwich:

>>> from dulwich.repo import Repo
>>> from dulwich import ignore
>>> a=Repo('.')
>>> b=ignore.IgnoreFilterManager.from_repo(a)
>>> b.is_ignored('a/b/')
# None

EDIT: I think this specific case is not an issue though. There is this specific example in the gitignore man page:

$ cat .gitignore
# exclude everything except directory foo/bar
/*
!/foo
/foo/*
!/foo/bar

Which suggests that /foo/* should only ignore the files within foo and not foo itself allowing !/foo/bar to work. I think it's really git check-ignore being weird with the trailing slash because:

git check-ignore -v a/b

Does show that the directory isn't ignored. I guess I can modify the script to not pass the trailing slash to check-ignore 😖, Git probably does stat to check for a directory rather than testing for a trailing slash in the passed in path.

@segevfiner
Copy link
Contributor Author

@jelmer I edited #526 (comment) a bit. Hopefully you noticed 😛. Just be sure not to accidentally regress this case from the gitignore man page when you change * to match the empty string too:

$ cat .gitignore
# exclude everything except directory foo/bar
/*
!/foo
/foo/*
!/foo/bar

@jelmer
Copy link
Owner

jelmer commented Jul 16, 2017

I've added a specific testcase for the example from the manpage you mentioned. It seems to pass :)

@jelmer
Copy link
Owner

jelmer commented Jul 16, 2017

I've "fixed" the absolute path issue by requiring that callers pass in a relative paths.

@jelmer
Copy link
Owner

jelmer commented Jul 19, 2017

@segevfiner are you aware of any more issues? If not, I'll mark this closed and will do a release in a couple of days :)

@segevfiner
Copy link
Contributor Author

segevfiner commented Jul 19, 2017

I have ran my script (with .rstrip('/') for calling check-ignore) against some repositories I have cloned locally and found this:

git init temp && cd temp
mkdir spam
touch spam/ham
cat > .gitignore <<EOF
/spam/*
EOF
git check-ignore -v spam spam/ham

Output

.gitignore:1:/spam/*    spam/ham.txt
# Only the file inside spam is ignored but not spam itself

Dulwich:

>>> from dulwich.repo import Repo
>>> from dulwich import ignore
>>> a=Repo('.')
>>> b=ignore.IgnoreFilterManager.from_repo(a)
>>> b.is_ignored('spam/')
True
>>> b.is_ignored('spam/ham.txt')
True

It feels like it would be easy to regress something else by accident while fixing this, be careful. It's confusing, and I'm not sure myself exactly how Git handles those kind of patterns so that they don't match the directory itself. The code for this is in dir.c in Git's sources if you want to take a look, but it's quite complicated.

(I have also seen projects do spam/*.* which might also cause trouble)

@jelmer
Copy link
Owner

jelmer commented Jul 19, 2017

morgaine:/tmp/foo% cat .gitignore
/b/*
morgaine:/tmp/foo% git check-ignore -v b
morgaine:/tmp/foo% git check-ignore -v b/
.gitignore:1:/b/* b/

morgaine:/tmp/foo% dulwich check-ignore b
morgaine:/tmp/foo% dulwich check-ignore b/
b/

So git is behaving differently depending on whether a slash is passed in, like Dulwich.

@segevfiner
Copy link
Contributor Author

That's probably just check-ignore, might be a bug in it that it doesn't normalize the passed in path.

Here is an example that shows where there is a difference due to this:

git init temp && cd temp
mkdir a
touch a/b.txt a/c.dat
cat > .gitignore <<EOF
a/*
!a/*.txt
EOF
git check-ignore -v a a/b.txt a/c.dat

Output:

.gitignore:2:!a/*.txt   a/b.txt
.gitignore:1:a/*        a/c.dat

(The directory "a" is not ignored, "a/b.txt" is not ignored, and "a/c.dat" is ignored)

Dulwich:

>>> from dulwich import ignore
>>> from dulwich.repo import Repo
>>> a=Repo('.')
>>> b=ignore.IgnoreFilterManager.from_repo(a)
>>> b.is_ignored('a/')
True
>>> b.is_ignored('a/b.txt')
True
>>> b.is_ignored('a/c.dat')
True

@jelmer
Copy link
Owner

jelmer commented Jul 19, 2017

Ahh, thanks - that makes sense. I'll do some more digging and see if I can resolve this specific case.

@jelmer
Copy link
Owner

jelmer commented Jul 20, 2017

We should be able to prevent regressions, so long as we add test cases for all the corner cases we identify.

I've made a change to fix the last issue you mentioned.

@segevfiner
Copy link
Contributor Author

I didn't find anything else using the script. 😄

(Barring the fact that git check-ignore returns a 0 exit code for paths that are explicitly included using a negative pattern... 😓).

@jelmer
Copy link
Owner

jelmer commented Jul 21, 2017

Fixed the exit code in master. I'll close this bug report, we can always fix other cases later if we notice them.

@jelmer jelmer closed this as completed Jul 21, 2017
@segevfiner
Copy link
Contributor Author

@jelmer Actually I was speaking about the exit code of the standard Git check-ignore command (Not Dulwich's). It is documented as returning 0 when any passed paths are ignored and 1 otherwise. But it also returns 0 when any path is explicitly included using a negative ignore pattern. Smells a bit like a bug in it.

The code you added to Dulwich actually seems to do what you expect only returning 0 when a passed path is really ignored. 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants