Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abort on filesystem loop #273

Merged
merged 1 commit into from
Nov 14, 2014
Merged

abort on filesystem loop #273

merged 1 commit into from
Nov 14, 2014

Conversation

e2
Copy link
Contributor

@e2 e2 commented Nov 13, 2014

@rymai, @thibaudgg - I think this is pretty drastic, so I wanted some feedback before thinking of releasing this.

TL;DR With this patch - Listen aborts (!) if you symlink a directory to another directory within the project

There are 3 scenarios:

  1. normal symlink - when there's a symlink to a directory outside the watched dir, e.g.
    foo/bar -> /baz/bar # outside watched dir

  2. duplicate watched dir - when a symlink points to a directory WITHIN the watched dir

  3. Bad, i.e. filesystem loop (and e.g. find -L also halts on this)
    foo/bar/baz -> ../..

The patch

This patch makes Listen abort on 3 (obviously), but also aborts on 2, which I think is ok, because either the user resolves the ambiguity (see below), or complex resolving would have to be implemented in Listen (pointless waste of time).

Best solution

The "best" solution is for users to set watchdirs in Listen in such a way, that 2 is avoided.

Biggest challenge with symlinks:

Intuitively, users "expect" Listen to do a "REVERSE symlink resolution" and ... return ONLY one path/event!

E.g., given the user edits: project/foo/bar/hello.rb, and project/foo links to -> project/vendor/baz, and listen watches project then what should Listen report as modified?

a) what the user edited? (foo/bar/hello.rb)

b) or what physically changed relative to the watched dir ? (vendor/baz/bar/hello.rb)

Of course, option a) is most "user friendly" and "expected", but it gets hairy if you have multiple symlinks pointing to vendor/baz/bar.

My stance on this

I'm in favor of not letting users watch the same physical directory multiple times (files are ok) because of:

  • complexity
  • ambiguity
  • resources (threads mostly)
  • performance

Of course it would be up to the user to make sure there are no loops or symlinks - and I believe that's how it should be, because Listen cannot "lock" the filesystem for itself or find out what symlink was used when a file changed.

Semver

At the very least I can release a 3.0 of Listen.

What do you think?

@e2
Copy link
Contributor Author

e2 commented Nov 13, 2014

Linked issue: #259

@thibaudgg
Copy link
Member

Sounds fair to me, go for it!

@akerbos
Copy link

akerbos commented Nov 13, 2014

Great! This has been long in the coming.

I don't follow the reasoning here, though:

complex resolving would have to be implemented in Listen (pointless waste of time)

So I thought I'd add my two computer science cents.

Finding duplicates or even cycles in a directed graph (which a file system with symlinks is) is (almost) as fast as exploring that graph; in fact, you can do it along the way. All you need is a boolean flag per (real) file which indicates whether you've been there before.

So this part of the argument is not quite accurate. Is usability really such a concern? Are there real disadvantages to using -- per convention -- always the real path or the shortest path or the "most local" path or ... of a given watched directory/file? You can even collect and store all non-cyclic paths for reporting (not for watching!) and duplicate reports; the overhead incurred is the user's responsibility.

What I'm trying to say is: algorithmically, there is no big issue at all with duplicate symlinks and cycles. It's easy (and efficient) to detect either and filter and/or normalise paths. (In fact, you implicitly do this already with all the paths containing . and ...)

@e2
Copy link
Contributor Author

e2 commented Nov 13, 2014

Finding duplicates or even cycles in a directed graph (which a file system with symlinks is) is (almost) as fast as exploring that graph

The problem isn't detecting duplicates - it's deciding how to easily watch directories recursively (technical constraint) and continously (regardless of dirs being added while scanning) in a way to avoid watching the same directory more than once.

All you need is a boolean flag per (real) file which indicates whether you've been there before.

That's what the patch already does (to know when to abort with an error).

Is usability really such a concern?

Yes, because e.g. Guard uses regexps on relative paths.

Example: technically users expect that changing app/foo.rb triggers the change app/foo.rb (regardless of symlinks) and not vendor/lib/bar/app/foo.rb - if that's where the real file is.

How do you know whether the user is editing app/foo.rb and not vendor/lib/bar/app/foo.rb?

Well, you don't because the filesystem doesn't provide that information (except maybe when polling - which creates multiple events, so problem type is OS-specific).

Are there real disadvantages to using -- per convention -- always the real path or the shortest path or the "most local" path or ... of a given watched directory/file?

Yes. Human beings don't like using redundant and lengthy paths, e.g. I run vi Rakefile, and not vi /home/me/projects/github/whatever/Rakefile every single time. The current work directory is the "context" and anything that starts with the watched dir contains a redundancy.

Those are user expectations - not meeting them means frustration, issues, etc.

The problem is: system events use full paths, so currently I just try to find how the full path is relative to the current one (not ideal, but better than a full path, because at least it's tied to a specific watched directory).

This patch, though, opens up the possibility of transforming a real path back into the original symlink (sic!).

the overhead incurred is the user's responsibility.

This is exactly what I was aiming for - it's the users responsibility to keep their filesystem and directory structure reasonable, and the error makes them aware they are watching more files and directories than necessary.

What I'm trying to say is: algorithmically, there is no big issue at all with duplicate symlinks and cycles.

Algorithmically, the challenge is doing the reverse (having a resolved file and asynchronously discovering/guessing which symlink the event originated from).

It's easy (and efficient) to detect either and filter and/or normalise paths.

Trust me, in Listen it isn't efficient:

  • iteration through Celluloid was painfully slow (seconds) even on small projects (reason for synchronously doing Record#build during startup).
  • Pathname is painfully slow (compared to Dir and File), but will likely become essential because of encoding issues (see other tickets)
  • with this patch, there's now an additional File.realpath every directory gets - just to detect cycles
  • the time spent in mutexes and conditionals would increase due to the calculations

And, it isn't easy:

  • Record#build doesn't have coverage, while what you propose will explode the test cases which lack coverage
  • Record is already complex to be both fast and efficiently store directory information without constantly splitting paths
  • Listen doesn't differentiate between symlinks and real paths at all - there's no place for them in the code (exponential growth of test cases)
  • users don't even differenciate between symlinks and their real counterparts
  • even almost every non-system tool doesn't differentiate - so why should Listen handle cases specifically
  • heck, even find -L reports warnings about cycles, show how these are NOT normal nor expected
  • things get hairy when there are multiple(!) watched dirs unrelated to the current work directory and all you're getting are full paths
  • as for processing paths reliably, there are reasons why Pathname is soooo slow ...

In short, Listen already allows too much and does too much (just presenting my view based on many hours of debugging issues in Listen).

Handling more than one "reference" (dir or symlink) in a watched directory is like expecting an OS to continue working reliable on unreliable, unpredictable and faulty hardware (getting more reliable hardware will almost be a faster/cheaper option).

If only Listen could get away with doing a lot less... ;)

e2 added a commit that referenced this pull request Nov 14, 2014
@e2 e2 merged commit ec79355 into master Nov 14, 2014
@e2 e2 deleted the e2-stop_on_symlink_loop branch November 15, 2014 11:42
@stephan-nordnes-eriksen

One thought: With this commit listen assume that the folder listened to is under the control of the person initiating listen. Eg. Listen is now useless for watching a users directory where the developer has no control of what is going on.

Idea: How about adding a "no symlink" mode which simply ignores symlinks?

@e2
Copy link
Contributor Author

e2 commented Nov 16, 2014

Listen is now useless for watching a users directory where the developer has no control of what is going on.

I assume the user can choose which folders can be watched - in that case, it's just a matter of making a "unique list" of directories - and pass them to Listen.

Of course, this is slightly troublesome to build a "correct" list of directories - and listen could do this for you automatically.

Which is exactly what this feature request is for: #274

(Pull requests are welcome - I'll help answer questions, etc.).

@mockdeep
Copy link

Just ran into this. It would be nice if Listen somehow resolved this instead of just refusing to continue. Is there no way to just not continue following symlinks if it finds a loop? I'd also be fine with it not following symlinks at all, if that's easier.

@e2
Copy link
Contributor Author

e2 commented Nov 19, 2014

@mockdeep - yes, there is a way. It's this feature request:
#274 (PRs are welcome - I'll gladly help with advice, etc. - I'm currently just overwhelmed with other higher priority bugs to fix).

@mockdeep
Copy link

@e2 Ah, hadn't seen that one. Any suggestions on where I should look first if I wanted to poke around?

@e2
Copy link
Contributor Author

e2 commented Nov 19, 2014

Yes! I've added 4 points in the description here: #274

@stephan-nordnes-eriksen

I feel that guard/listen is a bit overcomplicated, so I made a minimalistic replacement for my use case. It has way less features, and is not as accurate, but it gets the job done at a significantly reduced cpu cost, at least on osx.

It might be worth having a look: https://github.com/stephan-nordnes-eriksen/feather_watch

@e2
Copy link
Contributor Author

e2 commented Nov 19, 2014

I feel that guard/listen is a bit overcomplicated

Yes, that's because of lots of trial and error getting everything working:

https://github.com/guard/listen/issues?page=3&q=is%3Aissue+is%3Aclosed

I tried to make it simpler, but unfortunately every bit of code has a strong reason to stay there.

Also, there are many people already relying on how listen works (also very, very old versions), so API compatibility is very important as well - especially since it's used by Guard.

Listen 3.x is still in the pipeline, and one planned feature is to give users a way of getting a "lightweight" setup (using the adapters more directly) and where you could choose the features you're interested in.

Personally, I'd just use rb-inotify directly and never have a single care for OSX or Windows users - because rb-inotify needs the least amount of resources and provides enough functionality to be used alone (WDM comes second place).

As for the file events you're using: https://github.com/thibaudgg/rb-fsevent#fileevents

You may want to checkout vagrant-gatling-rsync, which kind of implements what you need.

and is not as accurate

This is a deal breaker because of editor support, even with rb-inotify: https://github.com/guard/guard/wiki/Analysis-of-inotify-events-for-different-editors. E.g. with some Vim settings, you'll get a file move and a file delete and ... nothing else (you won't detect changes on the actual file saved).

I think nothing frustrates users more than saving a file in an editor which either doesn't trigger a file event, or triggers multiple events.

but it gets the job done at a significantly reduced cpu cost, at least on osx.

OSX is one area I can't help much with - so if you can get editor support working flawlessly on it, I could help adapt Guard to use your library, etc.

@stephan-nordnes-eriksen

No doubt there is a lot of work behind listen. I would be an honour if you decided to use my library within listen. It is really just as close to pure file system events as possible by using existing libraries (rb-inotify, wdm, and rb-fsevent). Not much more.

As for editor support. I am not 100% sure what that implies. Maybe there is some magic that should be added, but I honestly don't know what that should be because you get (as far as I know) all the file system events created, but sometimes you get only 2 if there is multiple in quick succession, the first and the last. On osx, well, I might dig around in rb-fsevent to see if I am amble to append the actual events, as it is the missing link for both listen and feather_watch. Only issue is that it is about 5 lines of ruby, and the rest is magically conjured into existence from some c code.

@e2
Copy link
Contributor Author

e2 commented Nov 27, 2014

I would be an honour if you decided to use my library within listen.

Compatibility is more important for me (since that's where more of the problems and challenges are).

If you are willing to actively support feather_watch, deal with issues, etc., and you strongly believe many people would benefit from using it, I'd suggest you first integrate it with listen-compat.

Ideally, users could just use an environment variable like LISTEN_IMPLEMENTATION=feather_watch and your gem would be loaded and initialized.

Basically, listen-compat is supposed to be an "wrapper" for any available/installed listen version, but that also means it could easily support any library providing the same interface (see README).

Feel free to send pull requests to have it integrated.

It is really just as close to pure file system events as possible by using existing libraries (rb-inotify, wdm, and rb-fsevent)

If I were you, I'd definiately go through all the closed issues in Listen - because there's too much reinventing the wheel in terms of potential problems. Also, the lack of Polling mode is pretty much a deal breaker on virtual or special filesystems or with complex trees with symlinks.

As for editor support. I am not 100% sure what that implies.

It means I save a file, and e.g. Guard, or Sass or Jekyll causes no more and no less than one "changed" event, regardless if the editor uses move_to, atomic save, rename, temporary files, backup files, etc. This also means ignoring editor backup files (Listen has a list of builtin defaults).

Sometimes and editor uses multiple events to "change" a file - this should be reported as one change within the given time interval (see :wait_fo_delay option in Listen).

Maybe there is some magic that should be added

E.g. On Linux, events have a "cookie" option, which says which operations are related. Listen uses this information to "cancel out" events logically, e.g. removing a file and adding it again is translated into a "changed" event, because although the file was removed it exists, and although it was added, it isn't a new file (because it was deleted).

but sometimes you get only 2 if there is multiple in quick succession, the first and the last.

The problem with Listen is not so much performance, as people expect "1 saved file" to generate "1 event".

Personally, when I need performance, I just use the rb-inotify library directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants