-
Notifications
You must be signed in to change notification settings - Fork 8.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: run spell check in CI, fix remaining issues #4799
Conversation
Interesting. So we're going to have to whitelist all of our variable names and short-hands in here, it looks like. Have you noticed that being a big deal between commits? Or do folks tend to reuse the same shorthands in a given project? If on average people only have to add 0-5 "words" for their variables to the whitelist when they make a PR, then I think it's worth it. I'm still all for trying this, but if it becomes too onerous, we might have to come up with some more rules or patterns on how we can identify things to exclude from the scanner INSIDE a given file type. It's hard for me to mentally visualize what this is going to look like in practice from your description. I get the rough idea from some of your links what the annotations look like, but the whole end-to-end I'm struggling to understand without working through it myself. I think we're just going to have to try it and refine it from there. The additional spelling fixes look correct to me as well. |
I think projects tend to reuse shorthands. And certainly w/in a commit or series. I'm also open to ideas for how to write rules for extra exclusions. e.g. Ignore strings shorter than x characters, or longer than y characters. fwiw, there are 71 two-character (676 possible, 612 not in a dictionary*) and 399 three-character (17576 possible, 16991 not in a dictionary) tokens in the whitelist:
-- collectively, much more of the whitelist is 4+5+6 character sequences Most of the annoying sequences I run into are hashes / base64 encodings (e.g. PKI or data: or git shas). My general response to those is to just ignore the file entirely -- the most common types are either .key/.pem/.crt or things like package.json/package-lock.json. One thing you should do is review the whitelist, in reviewing it just now, I realized that there were a couple of misspellings in it... which I'm addressing by fixing & removing. One example is There's also the odd case of Note that you'll probably see things like Anyway, please feel free to try it for a bit. I'm definitely eager to improve the experience. |
Agreed, yeah probably.
I would immediately gravitate toward regex, but some people hate that.
OK. I'll go read it as fully as I can.
I don't think I'd be that critical, personally. But someone else might care more than me.
It could even change straight up into
Eh, that's OK.
Yes, I think we'll refine it further after it's in. |
argc | ||
args | ||
argv | ||
arigatoo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've always seen arigatoo
as arigatou
given it comes from ありがとう
and と
is usually to
and う
is usually u
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// (hiragana doomo arigatoo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting. I would have written doomo
as doumo
too per ど
-> do
, う
-> u
, も
-> mo
.
BGRA | ||
BHID | ||
bhoj | ||
Bhojwani |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know, it might be nice to have another whitelist file that is just "names of contributors and their aliases" and make the true whitelist be a union of the files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. That's probably doable. If whitelist is a directory instead of a file, I can cat them all together. It'll take a bit to reason through, but ... yeah...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if it's a directory, I'd have like
whitelist\everythingelse.txt
whitelist\names.txt
whitelist\t.txt
whitelist\n.txt
whitelist\r.txt
whitelist\README.md
<-- to explain WTF the files mean
or whatnot.
DFFF | ||
DFFFu | ||
DFu | ||
DGCZZu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is stuff like DGCZZu
coming from? That feels like nonsense that should be put in its own bucket and not applied universally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, base64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YouTube:
Line 112 in 941a44a
* Windows Terminal Launch: [Terminal "Sizzle Video"](https://www.youtube.com/watch?v=8gw0rXPMMPE&list=PLEHMQNlPj-Jzh9DkNpqipDGCZZuOwrQwR&index=2&t=0s) |
terminal/doc/terminal-v1-roadmap.md
Line 25 in c879cd3
| 2019-05-07 | [Announcement](https://devblogs.microsoft.com/commandline/introducing-windows-terminal/) | Terminal announced & open-sourced ([Build 2019 Terminal session](https://www.youtube.com/watch?v=KMudkRcwjCw), ["Sizzle" video](https://www.youtube.com/watch?v=8gw0rXPMMPE&list=PLEHMQNlPj-Jzh9DkNpqipDGCZZuOwrQwR&index=2&t=0s)) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow that's curious. If we had regex, I'd be inclined to make it ignore all of a URL pattern for regexs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found plenty of typos in URLs... Some can be fixed, e.g. some places include a numerical ID which means one can change the human readable portion :-), and others can be fixed by e.g. updating the website content...
That said, being able to exclude things like youtube urls does seem useful... I'll have to think about how to deal w/ that.
My imaginary pipeline is something like:
- select files -- excludes.txt
- filter input -- my spelling repo has code for things like this -- specifically to address
data:
URLs (a key source of base64) and general hex tokens -- but I haven't thought through how to do it here. As you can see, I'm using regular expressions, so yes, that's probably what I'd go w/. Again, probably one regexp per line. - parse corpus
- drop words found in dictionary -- currently the dictionary isn't configurable in the action (I intend to support this), and the proposal to add things like curated apis would plug in here
- filter output -- whitelist.txt -- this phase is designed to warn about stale whitelist entries (and includes code to automatically remove entries from it as well as adding new entries).
Would you want multiple whitelist files, or would it be sufficient to have multiple "dictionary" files? -- The distinction is that dictionaries are treated as read-only -- the tool will not warn you about unused items in them. Whereas the tool will complain if you've whitelisted something and it can't find it in the whitelist.
It's easier for me to imagine code that handles multiple dictionaries and a single whitelist (especially if I add the code to handle instance counts). Although if I have to handle comments, then I can probably make things work using some fancy greps.
-- I think I might also drop support for having the whitelists live outside of the current branch, right now I'm still theoretically supporting an older model where the tool retrieves the whitelist+excludes files from remote locations, but that won't work well w/ enumerating files...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've finally implemented regexp support along w/ multiple file support and dictionaries. I've just published it and am letting it chew on this repository....
The lazy approach to generating a dictionary would be to create a repository with just the spellchecker and all the files whose contents you believe are ok, the warning output should be valid input as a dictionary file (instead of as a whitelist file). -- I'll have to document this approach.
If that's too inconvenient, I can think about other ways of handling the problem.
While you might be comfortable checking in a script into your repository, I worry that others might not.
1f14140
to
56d9cfe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've released a newer version of the spell checker, and I've split things out into dictionaries (including Names and Fonts).
I've put together a couple of basic patterns which seem to work reasonably well, and improved my basic templates. In working to regularize these, I spotted a couple of additional typos in the project. (Hello rying
.)
I think this should be good to go.
@@ -1281,7 +1281,7 @@ Pane::SnapSizeResult Pane::_CalcSnappedDimension(const bool widthOrHeight, const | |||
// already snapped or minimum size. | |||
// Arguments: | |||
// - widthOrHeight: if true operates on width, otherwise on height. | |||
// - sizeNode: a layouting node that corresponds to this pane. | |||
// - sizeNode: a layout size node that corresponds to this pane. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling out this commit
@@ -182,7 +182,7 @@ class Pane : public std::enable_shared_from_this<Pane> | |||
}; | |||
|
|||
// Helper structure that builds a (roughly) binary tree corresponding | |||
// to the pane tree. Used for layouting panes with snapped sizes. | |||
// to the pane tree. Used for laying out panes with snapped sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling out this commit
@@ -765,7 +765,7 @@ void SCREEN_INFORMATION::SetViewportSize(const COORD* const pcoordSize) | |||
|
|||
const CONSOLE_INFORMATION& gci = ServiceLocator::LocateGlobals().getConsoleInformation(); | |||
|
|||
// If we're in terminal scrolling mode, and we're rying to set the viewport | |||
// If we're in terminal scrolling mode, and we're trying to set the viewport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was hiding :-o
@@ -307,7 +307,7 @@ const bool RenderData::IsGridLineDrawingAllowed() noexcept | |||
{ | |||
// Otherwise, for compatibility reasons with legacy applications that used the additional CHAR_INFO bits by accident or for their own purposes, | |||
// we must enable grid line drawing only in a DBCS output codepage. (Line drawing historically only worked in DBCS codepages.) | |||
// The only known instance of this is Image for Windows by TeraByte, Inc. (TeryByte Unlimited) which used the bits accidentally and for no purpose | |||
// The only known instance of this is Image for Windows by TeraByte, Inc. (TeraByte Unlimited) which used the bits accidentally and for no purpose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling this out.
@@ -316,7 +316,7 @@ void ProcessCtrlEvents() | |||
{ | |||
/* | |||
* Status will be non-successful if a process attached to this console | |||
* vetos shutdown. In that case, we don't want to try to kill any more | |||
* vetoes shutdown. In that case, we don't want to try to kill any more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥔. 🥔🥔.
\d+x\d+Logo | ||
Scro\&ll | ||
# selectionInput.cpp | ||
:\\windows\\syste\b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syste
is otherwise reported as a false positive. This is because of an intentionally narrow example in:
terminal/src/host/selectionInput.cpp
Line 416 in 0f82811
:\windows\syste |
There's no |
Dictionaries can include words that don't appear in the corpus. So, you could, e.g. have the entire English language in a dictionary, or a bunch of unused apis, or a list of employees. You would only whitelist things that are currently used, but which aren't really words -- when they're removed, the tool will complain and suggest removing them from the whitelist. |
Excellent clarification, thank you. @DHowett-MSFT, I'd like to bring this in. Are you okay with that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this so much. Thanks.
@jsoref I really welcome the spell check. Especially because I'm not a native speaker and thus, I've been the source of quite a few of the findings already. |
@german-one: if you push a branch to this repository (no need to create a PR), it should automatically run (you'll want your branch to be past the merge of course). If you have a fork of this repository, then it might run–you'd go to actions and possibly enable actions. It is possible to run this in docker. The action log gives you the steps to run, it isn't particularly hard. I'll probably work on providing instructions for that soon. Please give a branch a try. |
@jsoref Outcome after resolving a merge conflict: |
@miniksa / @DHowett-MSFT / @german-one : I'm thinking that projects will want to be able to insert custom advice / best practices into the bot message. Offhand, I expect that would be another file, I'm open to advice on naming it (also, where would the best place be to discuss this stuff?). Tentatively, I think
How does that sound (as a feature)? I also need to look into the schedule hook, as it appears to be misbehaving. |
I like the idea of having bot advice! |
<!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? --> ## Summary of the Pull Request Translate automatically generated message text from German into English. <!-- Other than the issue solved, is this relevant to any other issues/existing PRs? --> ## References #4799 <!-- Please review the items on the PR checklist before submitting--> ## PR Checklist * [ ] Closes #xxx * [x] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA * [ ] Tests added/passed * [ ] Requires documentation to be updated * [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx <!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here --> ## Detailed Description of the Pull Request / Additional comments @miniksa mentioned in microsoft/terminal#4799 (comment) > Actually..... we would consider that block to be wrong. It should have had the English error text there. [...] It's just our practice to have everything be in English as that's our company's working language. [...] The translation is based on the message text found in the official docs: https://docs.microsoft.com/en-us/nuget/Consume-Packages/Package-restore-troubleshooting <!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well --> ## Validation Steps Performed
Summary of the Pull Request
This adds an action (currently pinned to my latest release 0.0.10-alpha) which performs spell checking.
Spell checking can occur:
References
PR Checklist
Detailed Description of the Pull Request / Additional comments
Comments are added to commits/pull requests
such as:
jsoref@31d73b0#commitcomment-37624165
Unrecognized words will be annotated
When a new commit is added while the action is active, as in:
check-spelling/examples-testing@54bcdf0
Whitelisting words
The details in the comment provide commands one can run to update the whitelist for the cases where a word should be accepted.
Excluding files
The
excludes.txt
file allows one to exclude files by directory, extension, or other patterns -- each line in the file is a perl regular expression.Scheduled runs
Due to GitHub constraints, PRs from forks don't have permission to comment/annotate, so instead, I'm trying using a schedule to check pull requests. -- Currently configured to look for PRs that have changed in the last hour (running hourly).
Quirks
Roughly doubled action runs
The use of both
push
andpull_request
generally means that for commits w/in a repository, one will get at least duplicate annotations. The comments should appear in two different places (one should end up on the PR and the other should end up on the commit).I haven't figured out the best way to address this, as while one is working on a branch, immediate feedback is best. -- But if one has a PR, I think it's best to be able to see the output in the PR...
Miscellaneous
I've included a couple of additional spelling commits. If the changes are wrong, we can either whitelist the words or add the files to excludes.
Adjustment
I'm happy to take feedback and make adjustments :-)
Validation Steps Performed
This is an alpha release, and this repository would be an early adopter.
We're using a version of this and I'm slowly seeding it to projects.
You can test this code by applying it to your own fork in GitHub and then adding a new commit with a word that isn't in the dictionary ...
cc: @miniksa, @DHowett-MSFT