-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Planning the rest of the changes to HTML parsing in PackageFinder
#10868
Comments
PackageFinder
This reworks the HTML parsing logic, to gracefully use `html5lib` on non-compliant HTML 5 documents. This warning softens the failure mode for users who are using commercial package index solutions that do not follow the requisite standards and serve malformed HTML documents.
I agree with this approach. I think it's better not to try to do our own parsing of the doctype declaration, just to do the fallback - we'll just get caught up in endless debates about what precisely is allowed - which is exactly what we're trying to avoid by using standard libraries for HTML parsing. The only real issue we've hit is missing doctypes, and if we downgrade the error to a warning, that resolves those without needing to fall back to We're then just back to a normal deprecation process for |
I agree too. I have a genuine question: what are the downsides of completely dropping the doctype declaration checks (i.e. not even warning about those we don't recognize). |
Probably none, actually. It's a good point - there is a risk here that we're getting entrenched in defending the released behaviour when we don't need to. I do think that PEP 503 requiring valid HTML5 is reasonable, because it makes writing tools easier in general. And there's an argument that if pip doesn't enforce the standard, no-one will stick to it and we'll end up with "what pip does" being a de facto standard in spite of having an actual standard 🙁 Whether that's enough to justify us being strict, I don't know. |
I'm fine with dropping the doctype checks as well. |
Maybe as a compromise just permanently throw warnings on non-complaint behavior? I definitely do appreciate pip pushing for standards. I have some musings of writing my own alternative to pip that has way less features but has an API and could be used as a library for other tools. |
I don't think it should be |
Let's be specific rather than using generic adages of software development. For example, what is the one thing here in "Do One Thing and Do It Well"? My understanding is this code's "One Thing" is to read a Simple Repository API that conforms to PEP 503. In which case this is what PEP 503 says:
"MUST" here is also in bold not just in capitals. It seems to me following the principle of "Do One Thing and Do it Well" of reading a PEP 503 compliant Simple Repository API means you MUST reject non-valid HTML5. I'm not in favor of that choice, I'm in favor of where the document is close enough to HTML5 but not actually HTML5 to emit a warning but accept it anyway. But I don't agree with appealing to some software adage makes one choice over another correct here. |
The main job of If someone wants to know if their HTML is valid HTML5, they can use a different package that is meant to do that. |
Please, let's not get into a debate over abstract principles here. It's not helpful, and speaking for myself at least, it won't alter my view (except maybe to annoy me enough to want to do the opposite of whatever is being argued for 🙁) |
Agreed, the choice here is a practical one not an abstract one. What is the value of checking the API being read is valid vs. the value of not breaking existing indexes? If all indexes followed the spec of the API they are providing we wouldn't be having this discussion. So assuming Pip provides leniency here and reads a non-valid API then how and where should it be lenient? Doctype check, shape of html structure provided, urls ending in a "/", etc. That's entirely up to the pip maintainers and I think it's reasonable to accept an API response whose only missing part is the doctype given that's what many indexes apparently do. But I would appreciate if that choice is explicitly being made that it's documented somewhere so it can be referenced by anyone else wishing to implement their own package installer as "this is what Pip does so let's follow that". |
I think folks here are largely OK with what I noted in OP, so that's what I'm gonna polish the implementation of today evening. That is, keeping the warning for indexes that don't have doctypes as we have in main, but flipping the default to use the newer logic. I understand the philosophical argument that is being made here, but the person making it IMO is missing that they're talking about pip here -- if pip were designed strictly conforming with the Unix philosophy, it'd likely be 4 or more tools instead.
I can't help but giggle at the oxymoron-like nature of this cherry-picked quote. :) |
FWIW, @ppena-LiveData I appreciate you chiming in, and I've gone ahead and incorporated the regex you've suggested into #10869. |
I may be misunderstanding you, but that's precisely what I disagree with. Anyone implementing a package installer should follow PEP 503. That's the standard. If anyone were to suggest that "do what pip does" is a better answer than that, then I'd be arguing strongly for pip to do as much HTML 5 validation as possible, never mind the consequences. We've worked far too hard on interoperability standards to abandon them now in favour of "do what pip does", or indeed any sort of "follow the canonical implementation" approach. OK, taking a deep breath now. But I'm a strong -1 on documenting that pip consumes anything but PEP 503 compliant indexes, and we should reserve the right to reject anything that's non-compliant without notice or apology. We won't, but that's because we have no intention of deliberately making life difficult for our users, not because we support non-compliant indexes. |
Well, to be honest, I kinda wanna implement a bunch of this (either myself, or via some third party package) and start spewing out more warnings around this area. :) I'm basically convinced at this point that unless we do this, most package indexes are not gonna follow the spec. |
@brettcannon has already implemented mousebender, which does this. I think his longer-term goal is to include that in So warnings now could be viewed as preparation for that later transition to using a fully standards-compliant library. (By the way, I don't know, and the point is that I don't care, how strict mousebender's validation is). |
Well, it doesn't have any HTML 5 validation logic -- https://github.com/brettcannon/mousebender/blob/main/mousebender/simple.py It also operates under the assumption that the page is valid and just works with |
Mousebender just relies on
My assumption is @dstufft put that there so people knew the page had to be structured so it could be parsed by a compliant HTML parser, not to force everyone to use HTML 5 instead of HTML 4 or something. Otherwise mousebender is as specific/strict as PEP 503 and the subsequent Simple repo API PEPs specify.
I would like to, but I'm not pushing for it right now either as the demand hasn't been there. |
I never would have expected "HTML5" to be the cause of any kind of contention ;) Specifying HTML5 was primarily just to remove any questions around how to parse the content. Primarily to short circuit any "well it works in pip... " logic. For those who aren't aware of the history, there was a time when PyPI served pages with some commented out HTML, because the parsing in setuptools (and maybe pip too, I don't recall) would break without that. There is also the differences between xhtml and html (xhtml requiring certain tags to be closed, whereas html doesn't, etc) and I wanted to cut off any argument about whether a html or xhtml parser was the correct thing to use. Largely speaking, it was included to say that you should interpret the responses with an html5 parser, and if a valid html5 parser errors out, then that's a bad API response. Which brings us to this issue. The question about whether or not HTML5 requires a doctype is not a simple one to answer (and honestly, if we really wanted to be strict about this, we should amend PEP 503). The sections that have been linked are intended to be read by people writing HTML5 pages. However there's another section entirely dedicated to how a parser should parse HTML5. One of the features of HTML5 is not just that it defines how one should parse HTML written as specified in the authoring section, but also that it defines how one should parse HTML that deviates from those guidelines. The entire document is quite long and complex and I won't even begin to pretend that I've read the entire thing. However as best as I can tell, the parsing document in section 13.2.6.4 states that if it encounters anything but a handful of things (for the purposes of this discussion, white space and doctype), then that generates a parser error. However, it's basically up to the parser if it bails out at that point or if it continues on, and if it continues on it will then be in the standards defined "quirks" mode. So roughly, PEP 503 doesn't really specify enough information to make a decision about what PEP 503 actually requires, but it's certainly within a valid HTML5 parser's remit to decide to bail out if the doc type isn't there. That points to the idea that PEP 503 probably does require it to be there. That being said, I think it's probably not worth getting too worried about a missing doc type. I suspect that html.parser isn't actually a fully valid HTML5 parser, given it calls perfectly valid HTML5 syntax "invalid" in the documentation (though it mentions it to say that it does work..., but the fact it's calling valid syntax invalid does not inspire confidence that it's fully implementing HTML5 correctly). If that's true, to some extent that means that technically using html.parser is, in theory, making a repository have to "guess" what pip supports, rather than being able to just use anything HTML5 allows, and the de facto spec ends up being the subset of HTML5 that html.parser correctly parses (or at least, correctly parses well enough for pip to get the data it needs out of it without error). Practically speaking, I think switching to html.parser is fine. I think the chances someone is doing something particularly exciting in their repository responses that html.parser doesn't support is pretty small. Likewise I think not enforcing the doctype existence is fine. While HTML5 does require it (sort of, see above) I don't see a lot of practical benefit (especially given the change likely makes some other technically valid edge cases broken, so it seems weird to be strict here but not elsewhere). |
Okay, I'm looking at the logic now and I think I have a concrete suggestion for how to change things that I'd like feedback from other @pypa/pip-team members on:
html.parser
by default, with the error about doctypes changed to a warning, and the relevant issue continuing to tell users to pass--use-deprecated=html5lib
if they are hitting a bug in the parser.--use-deprecated=html5lib
while continuing to warn about bad doctypes.The text was updated successfully, but these errors were encountered: