Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local folder with subfolders #82

Closed
XhmikosR opened this issue Sep 4, 2019 · 19 comments · Fixed by #218
Closed

Local folder with subfolders #82

XhmikosR opened this issue Sep 4, 2019 · 19 comments · Fixed by #218
Assignees
Labels
enhancement New feature or request released

Comments

@XhmikosR
Copy link
Contributor

XhmikosR commented Sep 4, 2019

@JustinBeckwith: I'm trying to add linkinator in the nodejs.org repo.

The file structure there is like this:

build/
  ar/
    index.html
    <more folders>/
      index.html
  ca/
    index.html
    <more folders>/
      index.html
  en/
    index.html
    <more folders>/
      index.html
  <more folders>/
  static/

So, if I do linkinator build/ --recurse it fails because it doesn't see an index.html. If I do linkinator build/en --recurse then I get false positives because linkinator can't find the root.

Is there a workaround or something you could do about it?

Thanks in advance!

@JustinBeckwith
Copy link
Owner

Ahh, that's not how --recurse works :) The --recurse flag just signals that we are going to follow links from the top level scanned page, and go another level deeper. Everything in linkinator assumes the link checking is happening in the context of either a.) a url you've pointed it to or b.) a directory with an index.html that sprawls out. It doesn't really know anything about filesystem.

To get this to work today - you'd have to run linkinator for each directory. Something like:

for d in */ ; do
    npx linkinator --recurse "$d"
done

@XhmikosR
Copy link
Contributor Author

XhmikosR commented Sep 8, 2019

The thing is that if I do that then linkinator cannot find the root. So I'm stuck :/

@JustinBeckwith
Copy link
Owner

I suppose I'm a little confused. What do you mean by not being able to "find the root"? Effectively - linkinator expects that it starts from a single crawling point, and can follow links to all other links it wants to be checked.

Are you saying that the subdirectories throughout your structure also have independent roots that need to be crawled?

@XhmikosR
Copy link
Contributor Author

XhmikosR commented Sep 8, 2019

If I do linkinator build/en --recurse, then linkinator throws errors because /static/ isn't in the en folder, it's one folder up.

Maybe adding a root option could help in such cases.

I'm not sure how I can make linkinator work in this case. Here's a WIP branch https://github.com/XhmikosR/nodejs.org/tree/master-xmr-linkinator

@JustinBeckwith
Copy link
Owner

Thank you - I get it now! All linkinator does when you point it at a directory is naively spin up a web server from the CWD to serve requests. You could do something very similar:

npx http-server .
for d in */ ; do
    npx linkinator --recurse http://localhost:8080/$d
done

@XhmikosR
Copy link
Contributor Author

XhmikosR commented Sep 9, 2019

Thanks for the reply!

The problem with this approach is that it's not cross-platform, unfortunately. Also, http-server is broken BTW. I personally use the serve package, but it seems I'm having some issues in this case.

That being said, I'll see what I can cook up. I still believe a new option for linkinator to load files from is useful though.

Also a thing I notice from a quick test is that linkinator throws for the link dns-prefetch links.

@XhmikosR
Copy link
Contributor Author

Alright, I successfully managed to set a CLI solution but that scans only one locale.

So, I figure we need to use the API, right? Not very familiar with it but it seems linkinator returns a Promise so my first attempt doesn't work.

If you have any time, here's my WIP branch with the API solution nodejs/nodejs.org@master...XhmikosR:master-xmr-linkinator-2

First, I want to at least land the English locale CLI scanning on nodejs/nodejs.org#2565 and then see how we can improve things with using the API directly.

@JustinBeckwith
Copy link
Owner

Using the API seems like the right thing to do here for sure. Linkinator does return a promise from the check method :) There's a simple example here:
https://github.com/JustinBeckwith/linkinator/#simple-example

@JustinBeckwith
Copy link
Owner

Just wanted to check in @XhmikosR - did you end up getting this rolling with the API? I was thinking alternatively - you could create an index page at the root level (mostly for testing purposes) that has a link to all of the per-language root pages, and then scan that page.

@XhmikosR
Copy link
Contributor Author

XhmikosR commented Nov 3, 2019

@JustinBeckwith I started a patch one day nodejs/nodejs.org@1cdc841 but never got to tweak it :/

Do not hesitate to PR if you have some time :) We definitely get some false positives, but my plan is to just keep a list of whitelisted domains, which will be better than nothing.

@nkuehn
Copy link

nkuehn commented Apr 24, 2020

Hi, I have a similar setup with microsites under the root that link to each other. I am now throwing an index.html with nothing but <a href="/foo"></a> (or multiple of these) into the root but it's not really nice. In production the webserver does a HTTP level redirect to the actual entry page, but that's not happening in the local webserver emulation here (how should it know my custom redirect...).

It would be great if the use case would be covered by the CLI, too since it's pretty good already with colored output, grouping etc.

Conceptually the issue is about being able to separate the first entrypoint from the scope of the recursion and the root path of the local webserver.

So what about an optional flag --entrypoint that allows to e.g. call

linkinator ./output-dir --entrypoint /home

With the effect that the local webserver is started on ./output-dir but the crawler starts at http://localhost:0000/home instead of defaulting to http://localhost:0000/home?

(could be a plural form too, taking multiple entrypoints)

Does that make sense? What do you think?

@JustinBeckwith
Copy link
Owner

I think I may have finally cracked this nut 😆 The latest version of linkinator (2.6 and up) now supports passing multiple locations, and it supports globbing. So you should be able to do something like this:

$ linkinator '*/index.html' 

This will do a top level scan (you can of course use --recurse if you want to follow links). The path config flag accepts a string or a string[], so this should work from the CLI or the API. Now when you pass multiple paths like this, it will default the server root to process.cwd(). Just to be safe, I added a flag to override that too:

$ linkinator --server-root . "**/index.html"

@XhmikosR @nkuehn want to take this for a spin and let me know how it goes?

@JustinBeckwith JustinBeckwith added enhancement New feature or request and removed needs more info question Further information is requested labels Dec 3, 2020
@JustinBeckwith JustinBeckwith self-assigned this Dec 3, 2020
@nkuehn
Copy link

nkuehn commented Dec 3, 2020

@JustinBeckwith sounds like a nice solution, I'll give it a try. Thanks for the update!

Is the --recurse crawling shared among the matches of the glob or would linkinator individually recurse (and then maybe check some parts twice if the matches link to each other)?

@JustinBeckwith
Copy link
Owner

Any time we check a link, regardless of the top level root path, it will be added to a cache and not checked again. So no double checking :) On the matching logic - the --recurse flag will match the root path given on the CLI to it's own tree of requests. So if a request starts at jbeckwith.com/page/2, during the scan of that tree it will only match links that start with jbeckwith.com/page/2.

@nkuehn
Copy link

nkuehn commented Dec 3, 2020

Thanks for the clarification! It's a bit embarassing but I think I am not understanding what the --server-root option really does, I assume I am just projecting what I hope it does.

Given my file layout

...
src
linkinator.config.json
public/
public/website1/
public/website1/index.html
public/website1/fooContent/index.html
public/
public/website2/
public/website2/index.html
public/website2/fooContent/index.html
...

when I call npx linkinator public/*/index.html --server-root public, I am getting

[404] http://localhost:5433/public/website1/index.html
[404] http://localhost:5433/public/website2/index.html
...

but had hoped for

[200] http://localhost:5433/website1/index.html
[200] http://localhost:5433/website2/index.html
...

I can work around it by temporarily copying linkinator.config.json into the public build output directory and then run linkinator from that directory without the --server-root option but since you built it I would prefer using the server root.

@JustinBeckwith
Copy link
Owner

The paths you pass into linkinator will be relative to the server-root. So you could say:

npx linkinator public/*/index.html --server-root .

Or you could say:

npx linkinator */index.html --server-root public

@nkuehn
Copy link

nkuehn commented Dec 3, 2020

The latter you propose was what I initially tried, but what I get is

$ npx linkinator */index.html --server-root public
[404] http://localhost:5738/public/index.html
[404] http://localhost:5738/websites/index.html

the version output is 2.7.0. "websites" accidentally also contains an index.html so the glob catches it. To me the output points towards the glob not being relative to the passed server root

@github-actions
Copy link

🎉 This issue has been resolved in version 2.11.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@nkuehn
Copy link

nkuehn commented Jan 8, 2021

Thanks, using the server root works now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request released
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants