Planning the rest of the changes to HTML parsing in `PackageFinder` #10868

pradyunsg · 2022-01-31T20:23:34Z

Okay, I'm looking at the logic now and I think I have a concrete suggestion for how to change things that I'd like feedback from other @pypa/pip-team members on:

In 22.0.x, change the fallback logic again -- dropping the stupid starts-with-doctype check and using html.parser by default, with the error about doctypes changed to a warning, and the relevant issue continuing to tell users to pass --use-deprecated=html5lib if they are hitting a bug in the parser.
In 22.2, drop the --use-deprecated=html5lib while continuing to warn about bad doctypes.

The text was updated successfully, but these errors were encountered:

This reworks the HTML parsing logic, to gracefully use `html5lib` on non-compliant HTML 5 documents. This warning softens the failure mode for users who are using commercial package index solutions that do not follow the requisite standards and serve malformed HTML documents.

pfmoore · 2022-02-01T11:09:54Z

I agree with this approach. I think it's better not to try to do our own parsing of the doctype declaration, just to do the fallback - we'll just get caught up in endless debates about what precisely is allowed - which is exactly what we're trying to avoid by using standard libraries for HTML parsing.

The only real issue we've hit is missing doctypes, and if we downgrade the error to a warning, that resolves those without needing to fall back to html5lib. And as you say, any actual issues with html.parser can be flagged and dealt with as people encounter them, with the workaround being to request the old parser.

We're then just back to a normal deprecation process for html5lib.

sbidoul · 2022-02-01T11:25:30Z

I agree too.

I have a genuine question: what are the downsides of completely dropping the doctype declaration checks (i.e. not even warning about those we don't recognize).

pfmoore · 2022-02-01T12:52:06Z

Probably none, actually. It's a good point - there is a risk here that we're getting entrenched in defending the released behaviour when we don't need to.

I do think that PEP 503 requiring valid HTML5 is reasonable, because it makes writing tools easier in general. And there's an argument that if pip doesn't enforce the standard, no-one will stick to it and we'll end up with "what pip does" being a de facto standard in spite of having an actual standard 🙁 Whether that's enough to justify us being strict, I don't know.

pradyunsg · 2022-02-01T12:57:46Z

I'm fine with dropping the doctype checks as well.

notatallshaw · 2022-02-01T15:11:12Z

I do think that PEP 503 requiring valid HTML5 is reasonable, because it makes writing tools easier in general. And there's an argument that if pip doesn't enforce the standard, no-one will stick to it and we'll end up with "what pip does" being a de facto standard in spite of having an actual standard 🙁 Whether that's enough to justify us being strict, I don't know.

Maybe as a compromise just permanently throw warnings on non-complaint behavior?

I definitely do appreciate pip pushing for standards. I have some musings of writing my own alternative to pip that has way less features but has an API and could be used as a library for other tools.

ppena-LiveData · 2022-02-01T16:14:27Z

I don't think it should be pip's responsibility to validate if a repo is returning valid HTML5 or not. Even just doing HTML5 DOCTYPE validation isn't straightforward/simple. I also feel that violates the Do One Thing and Do It Well principle and the robustness principle. It also seems more Pythonic to just use a try..except to just try using the standard html.parser (w/out a handle_decl()), and if that fails, then falling back to html5lib with a warning if it succeeds.

notatallshaw · 2022-02-01T16:39:01Z

I don't think it should be pip's responsibility to validate if a repo is returning valid HTML5 or not. Even just doing HTML5 DOCTYPE validation isn't straightforward/simple. I also feel that violates the Do One Thing and Do It Well principle and the robustness principle.

Let's be specific rather than using generic adages of software development. For example, what is the one thing here in "Do One Thing and Do It Well"?

My understanding is this code's "One Thing" is to read a Simple Repository API that conforms to PEP 503. In which case this is what PEP 503 says:

Within a repository, the root URL (/ for this PEP which represents the base URL) MUST be a valid HTML5 page with a single anchor element per project in the repository.

"MUST" here is also in bold not just in capitals. It seems to me following the principle of "Do One Thing and Do it Well" of reading a PEP 503 compliant Simple Repository API means you MUST reject non-valid HTML5.

I'm not in favor of that choice, I'm in favor of where the document is close enough to HTML5 but not actually HTML5 to emit a warning but accept it anyway. But I don't agree with appealing to some software adage makes one choice over another correct here.

ppena-LiveData · 2022-02-01T16:45:16Z

The main job of pip is not to validate whether HTML is valid HTML5 or not (which as I pointed out is not a simple/straightforward thing to do). It's main job is managing packages (yes, including reading from a Simple Repository API). That is the "one thing" it should do well, and it should be conservative in what it does with packages and liberal in what it accepts from others.

If someone wants to know if their HTML is valid HTML5, they can use a different package that is meant to do that.

pfmoore · 2022-02-01T16:48:24Z

Please, let's not get into a debate over abstract principles here. It's not helpful, and speaking for myself at least, it won't alter my view (except maybe to annoy me enough to want to do the opposite of whatever is being argued for 🙁)

notatallshaw · 2022-02-01T17:05:15Z

Agreed, the choice here is a practical one not an abstract one. What is the value of checking the API being read is valid vs. the value of not breaking existing indexes? If all indexes followed the spec of the API they are providing we wouldn't be having this discussion.

So assuming Pip provides leniency here and reads a non-valid API then how and where should it be lenient? Doctype check, shape of html structure provided, urls ending in a "/", etc.

That's entirely up to the pip maintainers and I think it's reasonable to accept an API response whose only missing part is the doctype given that's what many indexes apparently do. But I would appreciate if that choice is explicitly being made that it's documented somewhere so it can be referenced by anyone else wishing to implement their own package installer as "this is what Pip does so let's follow that".

pradyunsg · 2022-02-01T17:49:54Z

I think folks here are largely OK with what I noted in OP, so that's what I'm gonna polish the implementation of today evening. That is, keeping the warning for indexes that don't have doctypes as we have in main, but flipping the default to use the newer logic.

I understand the philosophical argument that is being made here, but the person making it IMO is missing that they're talking about pip here -- if pip were designed strictly conforming with the Unix philosophy, it'd likely be 4 or more tools instead.

It's main job is managing packages (yes, including reading from a Simple Repository API)

I can't help but giggle at the oxymoron-like nature of this cherry-picked quote. :)

pradyunsg · 2022-02-01T18:01:35Z

FWIW, @ppena-LiveData I appreciate you chiming in, and I've gone ahead and incorporated the regex you've suggested into #10869.

pfmoore · 2022-02-01T19:50:23Z

But I would appreciate if that choice is explicitly being made that it's documented somewhere so it can be referenced by anyone else wishing to implement their own package installer as "this is what Pip does so let's follow that".

I may be misunderstanding you, but that's precisely what I disagree with. Anyone implementing a package installer should follow PEP 503. That's the standard. If anyone were to suggest that "do what pip does" is a better answer than that, then I'd be arguing strongly for pip to do as much HTML 5 validation as possible, never mind the consequences. We've worked far too hard on interoperability standards to abandon them now in favour of "do what pip does", or indeed any sort of "follow the canonical implementation" approach.

OK, taking a deep breath now.

But I'm a strong -1 on documenting that pip consumes anything but PEP 503 compliant indexes, and we should reserve the right to reject anything that's non-compliant without notice or apology. We won't, but that's because we have no intention of deliberately making life difficult for our users, not because we support non-compliant indexes.

pradyunsg · 2022-02-01T20:07:08Z

I'd be arguing strongly for pip to do as much HTML 5 validation as possible

Well, to be honest, I kinda wanna implement a bunch of this (either myself, or via some third party package) and start spewing out more warnings around this area. :)

I'm basically convinced at this point that unless we do this, most package indexes are not gonna follow the spec.

pfmoore · 2022-02-01T20:25:54Z

@brettcannon has already implemented mousebender, which does this. I think his longer-term goal is to include that in packaging, at which point we'd almost certainly switch to that implementation.

So warnings now could be viewed as preparation for that later transition to using a fully standards-compliant library.

(By the way, I don't know, and the point is that I don't care, how strict mousebender's validation is).

pradyunsg · 2022-02-01T20:30:46Z

Well, it doesn't have any HTML 5 validation logic -- https://github.com/brettcannon/mousebender/blob/main/mousebender/simple.py

It also operates under the assumption that the page is valid and just works with <a> tags following that assumption. :)

brettcannon · 2022-02-01T22:20:58Z

Mousebender just relies on html.parser but I didn't think it was worth trying to write or use anything fancier. Does anyone know the motivation for requiring HTML 5?

the root URL (/ for this PEP which represents the base URL) MUST be a valid HTML5 page with a single anchor element per project in the repository.

My assumption is @dstufft put that there so people knew the page had to be structured so it could be parsed by a compliant HTML parser, not to force everyone to use HTML 5 instead of HTML 4 or something.

Otherwise mousebender is as specific/strict as PEP 503 and the subsequent Simple repo API PEPs specify.

I think his longer-term goal is to include that in packaging, at which point we'd almost certainly switch to that implementation.

I would like to, but I'm not pushing for it right now either as the demand hasn't been there.

dstufft · 2022-02-03T06:39:04Z

I never would have expected "HTML5" to be the cause of any kind of contention ;)

Specifying HTML5 was primarily just to remove any questions around how to parse the content. Primarily to short circuit any "well it works in pip... " logic.

For those who aren't aware of the history, there was a time when PyPI served pages with some commented out HTML, because the parsing in setuptools (and maybe pip too, I don't recall) would break without that. There is also the differences between xhtml and html (xhtml requiring certain tags to be closed, whereas html doesn't, etc) and I wanted to cut off any argument about whether a html or xhtml parser was the correct thing to use.

Largely speaking, it was included to say that you should interpret the responses with an html5 parser, and if a valid html5 parser errors out, then that's a bad API response.

Which brings us to this issue.

The question about whether or not HTML5 requires a doctype is not a simple one to answer (and honestly, if we really wanted to be strict about this, we should amend PEP 503).

The sections that have been linked are intended to be read by people writing HTML5 pages. However there's another section entirely dedicated to how a parser should parse HTML5.

One of the features of HTML5 is not just that it defines how one should parse HTML written as specified in the authoring section, but also that it defines how one should parse HTML that deviates from those guidelines. The entire document is quite long and complex and I won't even begin to pretend that I've read the entire thing. However as best as I can tell, the parsing document in section 13.2.6.4 states that if it encounters anything but a handful of things (for the purposes of this discussion, white space and doctype), then that generates a parser error. However, it's basically up to the parser if it bails out at that point or if it continues on, and if it continues on it will then be in the standards defined "quirks" mode.

So roughly, PEP 503 doesn't really specify enough information to make a decision about what PEP 503 actually requires, but it's certainly within a valid HTML5 parser's remit to decide to bail out if the doc type isn't there. That points to the idea that PEP 503 probably does require it to be there. That being said, I think it's probably not worth getting too worried about a missing doc type.

I suspect that html.parser isn't actually a fully valid HTML5 parser, given it calls perfectly valid HTML5 syntax "invalid" in the documentation (though it mentions it to say that it does work..., but the fact it's calling valid syntax invalid does not inspire confidence that it's fully implementing HTML5 correctly). If that's true, to some extent that means that technically using html.parser is, in theory, making a repository have to "guess" what pip supports, rather than being able to just use anything HTML5 allows, and the de facto spec ends up being the subset of HTML5 that html.parser correctly parses (or at least, correctly parses well enough for pip to get the data it needs out of it without error).

Practically speaking, I think switching to html.parser is fine. I think the chances someone is doing something particularly exciting in their repository responses that html.parser doesn't support is pretty small. Likewise I think not enforcing the doctype existence is fine. While HTML5 does require it (sort of, see above) I don't see a lot of practical benefit (especially given the change likely makes some other technically valid edge cases broken, so it seems weird to be strict here but not elsewhere).

pradyunsg changed the title ~~Dealing with the fallout of the html parsing changes~~ Planning the rest of the changes to HTML parsing in PackageFinder Jan 31, 2022

pradyunsg added the C: finder PackageFinder and index related code label Jan 31, 2022

pradyunsg mentioned this issue Jan 31, 2022

Ignore leading whitespace when checking for HTML doctype #10855

Closed

pradyunsg mentioned this issue Feb 1, 2022

Put html5lib logic behind a flag, unconditionally #10869

Merged

notatallshaw mentioned this issue Feb 1, 2022

Request: Delay removal of deprecated html5lib #10860

Closed

1 task

pradyunsg closed this as completed in #10869 Feb 3, 2022

ppena-LiveData mentioned this issue Feb 3, 2022

further thoughts about pip as an HTML5 reader #10880

Open

1 task

github-actions bot locked as resolved and limited conversation to collaborators Mar 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Planning the rest of the changes to HTML parsing in `PackageFinder` #10868

Planning the rest of the changes to HTML parsing in `PackageFinder` #10868

pradyunsg commented Jan 31, 2022 •

edited

Loading

pfmoore commented Feb 1, 2022

sbidoul commented Feb 1, 2022

pfmoore commented Feb 1, 2022

pradyunsg commented Feb 1, 2022

notatallshaw commented Feb 1, 2022

ppena-LiveData commented Feb 1, 2022

notatallshaw commented Feb 1, 2022 •

edited

Loading

ppena-LiveData commented Feb 1, 2022 •

edited

Loading

pfmoore commented Feb 1, 2022

notatallshaw commented Feb 1, 2022 •

edited

Loading

pradyunsg commented Feb 1, 2022 •

edited

Loading

pradyunsg commented Feb 1, 2022

pfmoore commented Feb 1, 2022

pradyunsg commented Feb 1, 2022

pfmoore commented Feb 1, 2022

pradyunsg commented Feb 1, 2022 •

edited

Loading

brettcannon commented Feb 1, 2022

dstufft commented Feb 3, 2022

Planning the rest of the changes to HTML parsing in PackageFinder #10868

Planning the rest of the changes to HTML parsing in PackageFinder #10868

Comments

pradyunsg commented Jan 31, 2022 • edited Loading

pfmoore commented Feb 1, 2022

sbidoul commented Feb 1, 2022

pfmoore commented Feb 1, 2022

pradyunsg commented Feb 1, 2022

notatallshaw commented Feb 1, 2022

ppena-LiveData commented Feb 1, 2022

notatallshaw commented Feb 1, 2022 • edited Loading

ppena-LiveData commented Feb 1, 2022 • edited Loading

pfmoore commented Feb 1, 2022

notatallshaw commented Feb 1, 2022 • edited Loading

pradyunsg commented Feb 1, 2022 • edited Loading

pradyunsg commented Feb 1, 2022

pfmoore commented Feb 1, 2022

pradyunsg commented Feb 1, 2022

pfmoore commented Feb 1, 2022

pradyunsg commented Feb 1, 2022 • edited Loading

brettcannon commented Feb 1, 2022

dstufft commented Feb 3, 2022

Planning the rest of the changes to HTML parsing in `PackageFinder` #10868

Planning the rest of the changes to HTML parsing in `PackageFinder` #10868

pradyunsg commented Jan 31, 2022 •

edited

Loading

notatallshaw commented Feb 1, 2022 •

edited

Loading

ppena-LiveData commented Feb 1, 2022 •

edited

Loading

notatallshaw commented Feb 1, 2022 •

edited

Loading

pradyunsg commented Feb 1, 2022 •

edited

Loading

pradyunsg commented Feb 1, 2022 •

edited

Loading