Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: add Travis job to check for dead URL links #27267

Closed
wants to merge 4 commits into from

Conversation

richardlau
Copy link
Member

@richardlau richardlau commented Apr 16, 2019

Refs: #27168 (comment)

In the long run we should add tools to prevent this ?
I wonder can https://github.com/JustinBeckwith/linkinator integrate to travis-ci ? (Suggested by @bnb )

Seems easy enough to put together.

This is a proof of concept of running https://github.com/JustinBeckwith/linkinator
on our built HTML docs to find broken URL links. I don't think we want to be
running this for every PR as when I run it locally it checks 900+ links which could
be seen as a denial of service attack if we run it too often.

Also it seems to be flagging some URL's as broken which I appear to be able to
navigate to in web browser (although there does appear to be redirecting going
on). If anyone wants to look into that or adopt this PR then great as I probably
won't be spending much time on it.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot nodejs-github-bot added the build Issues and PRs related to build files or the CI. label Apr 16, 2019
.travis.yml Outdated
- make doc-only
- npm install -g linkinator
script:
- linkinator out/doc/api -r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just scanning all.html is a good idea? Or does that just not make a difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it only run when the docs are touched? (Like a make target that depends on docs?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the same mechanism that add doc label to PR ?

Copy link
Member Author

@richardlau richardlau Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a change that greps the PR's patch diff for *.md files in doc/api and only runs the scan if grep returns something.

For local files linkinator works on a directory level so to scan, say, just all.html would involve moving files into a temporary directory. If someone has time maybe they can run linkinator on just all.html and see how many links it scans.

@Trott Trott added the wip Issues and PRs that are still a work in progress. label Apr 16, 2019
@gengjiawen
Copy link
Member

Looks like some are false positive: https://travis-ci.com/nodejs/node/jobs/193473667#L688.

@richardlau
Copy link
Member Author

Looks like some are false positive: https://travis-ci.com/nodejs/node/jobs/193473667#L688.

Like I said in the OP:

Also it seems to be flagging some URL's as broken which I appear to be able to
navigate to in web browser (although there does appear to be redirecting going
on).

e.g. https://travis-ci.com/nodejs/node/jobs/193473667#L688

   [404] https://chromedevtools.github.io/devtools-protocol/v8/Debugger
-bash-4.2$ curl --head https://chromedevtools.github.io/devtools-protocol/v8/Debugger
HTTP/1.1 404 Not Found
Server: GitHub.com
Content-Type: text/html; charset=utf-8
ETag: "5cb52d6a-1f0"
Access-Control-Allow-Origin: *
X-GitHub-Request-Id: 2612:7F6A:221DC7:2D2A1F:5CB6B0D2
Content-Length: 496
Accept-Ranges: bytes
Date: Wed, 17 Apr 2019 04:51:31 GMT
Via: 1.1 varnish
Age: 0
Connection: keep-alive
X-Served-By: cache-mdw17362-MDW
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1555476691.380921,VS0,VE29
Vary: Accept-Encoding
X-Fastly-Request-ID: 8ced15e319228e1c9c4666e404a8a66c20f7d9d8

-bash-4.2$
-bash-4.2$ curl https://chromedevtools.github.io/devtools-protocol/v8/Debugger
<!DOCTYPE html><meta charset="utf-8"><title>Chrome DevTools Protocol Viewer (redirecting)</title><script>// Copyright (c) 2016 Rafael Pedicini, licensed under the MIT License
var segmentCount=1,l=window.location;l.replace(l.protocol+'//'+l.hostname+(l.port?':'+l.port:'')+l.pathname.split('/').slice(0,1+segmentCount).join('/')+'/?p=/'+l.pathname.slice(1).split('/').slice(segmentCount).join('/').replace(/&/g,'~and~')+(l.search?'&q='+l.search.slice(1).replace(/&/g,'~and~'):'')+l.hash);</script>-bash-4.2$

It seems very weird for a 404 Not Found response page to be redirecting -- There are specific HTTP response codes for redirection.

@richardlau
Copy link
Member Author

Looks like some are false positive: https://travis-ci.com/nodejs/node/jobs/193473667#L688.

Like I said in the OP:

Also it seems to be flagging some URL's as broken which I appear to be able to
navigate to in web browser (although there does appear to be redirecting going
on).

e.g. https://travis-ci.com/nodejs/node/jobs/193473667#L688

   [404] https://chromedevtools.github.io/devtools-protocol/v8/Debugger
-bash-4.2$ curl --head https://chromedevtools.github.io/devtools-protocol/v8/Debugger
HTTP/1.1 404 Not Found
Server: GitHub.com
Content-Type: text/html; charset=utf-8
ETag: "5cb52d6a-1f0"
Access-Control-Allow-Origin: *
X-GitHub-Request-Id: 2612:7F6A:221DC7:2D2A1F:5CB6B0D2
Content-Length: 496
Accept-Ranges: bytes
Date: Wed, 17 Apr 2019 04:51:31 GMT
Via: 1.1 varnish
Age: 0
Connection: keep-alive
X-Served-By: cache-mdw17362-MDW
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1555476691.380921,VS0,VE29
Vary: Accept-Encoding
X-Fastly-Request-ID: 8ced15e319228e1c9c4666e404a8a66c20f7d9d8

-bash-4.2$
-bash-4.2$ curl https://chromedevtools.github.io/devtools-protocol/v8/Debugger
<!DOCTYPE html><meta charset="utf-8"><title>Chrome DevTools Protocol Viewer (redirecting)</title><script>// Copyright (c) 2016 Rafael Pedicini, licensed under the MIT License
var segmentCount=1,l=window.location;l.replace(l.protocol+'//'+l.hostname+(l.port?':'+l.port:'')+l.pathname.split('/').slice(0,1+segmentCount).join('/')+'/?p=/'+l.pathname.slice(1).split('/').slice(segmentCount).join('/').replace(/&/g,'~and~')+(l.search?'&q='+l.search.slice(1).replace(/&/g,'~and~'):'')+l.hash);</script>-bash-4.2$

It seems very weird for a 404 Not Found response page to be redirecting -- There are specific HTTP response codes for redirection.

JustinBeckwith/linkinator#24

@richardlau richardlau force-pushed the dead-urls-tell-no-tales branch from 3022b98 to 2c15115 Compare April 18, 2019 16:49
- npm install -g linkinator
script:
- if [ "${TRAVIS_PULL_REQUEST}" != "false" ]; then
DOC_FILES=`curl -sL https://github.com/nodejs/node/pull/${TRAVIS_PULL_REQUEST}.patch | grep -o '\bdoc/api/.*\.md\b'`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DOC_FILES=`curl -sL https://github.com/nodejs/node/pull/${TRAVIS_PULL_REQUEST}.patch | grep -o '\bdoc/api/.*\.md\b'`;
DOC_FILES=`git diff --name-only HEAD...$TRAVIS_BRANCH | grep -o '\bdoc/api/.*\.md\b'`;

Refs: https://stackoverflow.com/questions/41145041/list-files-modified-in-a-pull-request-within-travis

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had trouble with commit message linting before with $TRAVIS_BRANCH which is why we ended up downloading the patch files but I forget the details.

- make doc-only
- npm install -g linkinator
script:
- if [ "${TRAVIS_PULL_REQUEST}" != "false" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant because of

if: type = pull_request

@refack
Copy link
Contributor

refack commented Apr 18, 2019

Nice idea, but why not run it in our CI? (I'm thinking it might be a good idea anyway to move out the make run-ci target into a script in ./tools/)?
Also as for the trigger; changes are a good trigger, but even existing links can go stale, so an additional croned trigger might be prudent.

@richardlau
Copy link
Member Author

Nice idea, but why not run it in our CI? (I'm thinking it might be a good idea anyway to move out the make run-ci target into a script in ./tools/)?
Also as for the trigger; changes are a good trigger, but even existing links can go stale, so an additional croned trigger might be prudent.

Yeah, it probably makes more sense to run it on our CI. It could be run on the nightlies rather than PR's so that way we won't be hammering remote webservers.

@richardlau richardlau mentioned this pull request May 17, 2019
3 tasks
@richardlau richardlau closed this Aug 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. wip Issues and PRs that are still a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants