Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolving tuf metadata url for Warehouse index url #5

Open
jku opened this issue Jul 15, 2020 · 24 comments
Open

resolving tuf metadata url for Warehouse index url #5

jku opened this issue Jul 15, 2020 · 24 comments
Labels
API This issue relates to Warehouse client API

Comments

@jku
Copy link
Owner

jku commented Jul 15, 2020

To mimimize client configuration pip should be able to find the "TUF API endpoint" (the metadata directory) without any other information than the index url that is defined in pip.conf. This relation should be part of the Warehouse API promise

Three choices I can think of:

  1. TUF metadata is at sibling directory of index url:
    urllib.parse.urljoin("https://pypi.org/simple/", "../tuf/") # 'https://pypi.org/tuf/'
    urllib.parse.urljoin("https://my-host.com/path/to/simple/", "../tuf/") # 'https://my-host.com/path/to/tuf/'
    urllib.parse.urljoin("https://no-path.com/", "../tuf/") # 'https://no-path.com/tuf/' <-- bug
  1. TUF metadata is at fixed path on same host
    urllib.parse.urljoin("https://pypi.org/simple/", "/tuf/") # 'https://pypi.org/tuf/'
    urllib.parse.urljoin("https://my-host.com/path/to/simple/", "/tuf/") # 'https://my-host.com/tuf/'
    urllib.parse.urljoin("https://no-path.com/", "/tuf/") # 'https://no-path.com/tuf/' <-- bug
  1. I guess there is a third option if there can be 'hidden' directories under the index url:
    urllib.parse.urljoin("https://pypi.org/simple/", ".tuf/") # 'https://pypi.org/simple/.tuf/'

This way the index would be contained and easy to mirror/copy.

I am currently guessing the choice is option 1 and warehouse implementers are advised to not serve warehouse index from domain root to avoid the issue noted.

@jku jku added the API This issue relates to Warehouse client API label Jul 17, 2020
@jku jku changed the title resolving tuf metadata url for Warehouse instance resolving tuf metadata url for Warehouse index url Jul 17, 2020
@woodruffw
Copy link

I think this is ultimately a question for @ewdurbin and the other PyPI admins, but my personal vote is for option 1. Hosting it at a sibling path avoids assuming that every host always has a fixed path available.

@ewdurbin
Copy link

I agree that a sibling path is appropriate. @dstufft @di?

@dstufft
Copy link

dstufft commented Aug 12, 2020

I think we probably have to do something like 3 actually? Or we need some way for a repository to indicate where it's TUF metadata is found. All of the above options would work for PyPI, but when downstream projects like DevPI get deployed, 2 is completely unworkable because they'll host multiple repositories under a single domain. Likewise 1 won't work for all cases either, because there's no requirement that the api live at /simple/, it could live at /, in which case there is no possibility for a sibling path.

The only thing we know for sure will work if we're doing something hardcoded, is something living under the root of the simple API, which pretty much means some sub directory that isn't a valid package name.

The only other option I can think of is some way to ask an URL where it's TUF metadata lives... but that gets complicated with static mirrors like bandersnatch because pretty much the only thing you can rely on is a statically defined header (so we could do a HEAD request to the root url?) or a well known static file (but that opens the question if we have .tuf-location that points to where TUF lives, what are we really gaining over just mandating it's .tuf).

So tl;dr

  1. We can't assume we have access to anything outside the "root" of the /simple/ API.
  2. Whatever we do has to be implementable by a bunch of files sitting on disk served with a web server (we can assume the web server has typical configuration options like adding headers).

@jku
Copy link
Owner Author

jku commented Aug 14, 2020

Likewise 1 won't work for all cases either, because there's no requirement that the api live at /simple/, it could live at /, in which case there is no possibility for a sibling path.

Of course you could advice against hosting at '/' in your re-hosting/mirroring README. There may already be mirrors/instances hosting at '/' but even for those current functionality would not be broken: they just might not be able to use tuf.

... but I do see your point and have to agree with the following:

The only thing we know for sure will work if we're doing something hardcoded, is something living under the root of the simple API, which pretty much means some sub directory that isn't a valid package name.

Having looked at some client code that last bit sounds tricky in practice. E.g. .tuf is not a valid name according to PEP-0508 but clients have to deal with distributions made before PEP-0508 (and have historically been quite laissez-faire about this sort of things)... In practice that might be fine if we make sure .tuf only contains directories (that then contain the actual metadata files)?

@dstufft
Copy link

dstufft commented Aug 14, 2020

Yea. If I remember correctly, as long as .tuf/ doesn't return a HTML mimetype, pip will just ignore it.

It would be useful to see what the behavior is for completely invalid in a package name character. I don't remember what it is off the top of my head, but I could image doing something like ~tuf/ or something like that which is even less likely to colide.

@ewdurbin
Copy link

ewdurbin commented Aug 14, 2020

Clients SHOULD already be parsing the simple api HTML... maybe a pointer to the TUF metadata location should be part of the HTML <head> somehow?

@jku
Copy link
Owner Author

jku commented Aug 14, 2020

If I remember correctly, as long as .tuf/ doesn't return a HTML mimetype, pip will just ignore it.

Oh this is very likely true. Good point, I was only thinking of the package name aspect.

@dstufft
Copy link

dstufft commented Aug 14, 2020

Clients SHOULD already be parsing the simple api HTML... maybe a pointer to the TUF metadata location should be part of the HTML somehow?

Yea I mentioned something along those lines. It's workable, just kind of weird I think? The way TUF works is we're going to have TUF validate the fetch of the /simple/ page.. so we'd do this weird thing where we pull it down, ask it how to validate itself, then go fetch that to validate it. Not the end of the world (I think it's still secure) just kind of awkard.

The other awkward part of that is which response do we put it on? In theory it makes the most sense on /simple/ itself.. but that response is huge and modern clients don't actually fetch that page. So we'd probably want to put it on every page.. but I don't think that actually works? Well it does, but we basically lose TUF's protection on non existent packages (since they wouldn't have a response to have something in the </head> unless we did something weird like do the resolving until we find a package that exists, then backtrack and validate all of our responses up until that point.. which probably makes that a non starter (and opens the question of what if 100% of the packages don't exist?).

So I think if we're using some pointer to where the TUF metadata lives, it would have to be in a singular location, that a client could fetch before doing resolution, and given the problems with /simple/ that's probably a header on /simple/ so we can do a HEAD request, or some well known location (we could theoritcally make it more generic and do something like .well-known/tuf-meta.json or something (well known).

@jku
Copy link
Owner Author

jku commented Aug 14, 2020

Making sure we're on the same page: there are two different decisions here:

  • client decides whether to use TUF only based on local data already on the client
  • client tries to update the metadata based on what it gets from the server

So client not finding TUF metadata on server does not mean TUF is disabled: just that the metadata may not get updated.

I don't quite understand what this means:

we basically lose TUF's protection on non existent packages

I plan to only do anything with TUF (even updating metadata) once there is a distribution URL that needs to be downloaded -- this is to avoid refreshing metadata when it's not needed.

@dstufft
Copy link

dstufft commented Aug 14, 2020

Doesn't accessing /simple/<foo>/ also require invoking TUF?

@ewdurbin
Copy link

ewdurbin commented Aug 14, 2020

I really like the idea of using .well-known 👍, especially given that it just so happens to not be a valid project name.

@jku
Copy link
Owner Author

jku commented Aug 14, 2020

Doesn't accessing /simple/<foo>/ also require invoking TUF?

The package index HTML will not be verified by TUF, only the actual distribution files -- this is my understanding, @woodruffw can verify.

@ewdurbin
Copy link

ewdurbin commented Aug 14, 2020

Upon closer inspection, it is not clear if .well-known is allowed anywhere but off of the root URI... so we are probably breaking spec if it lives at https://pypi.org/simple/.well-known

Edit: It is not. Section 3 states:

Well-known URIs are rooted in the top of the path's hierarchy; they
are not well-known by definition in other parts of the path. For
example, "/.well-known/example" is a well-known URI, whereas
"/foo/.well-known/example" is not.

@dstufft
Copy link

dstufft commented Aug 14, 2020

The package index HTML will not be verified by TUF, only the actual distribution files

I'm pretty sure we lose a significant portion of the security promises of TUF if we do that, unless some other mechanism has been added, I think it's also a deviation from PEP 458 (well PEP 458 doesn't specify what installers must do, but it does indicate /simple/ pages should be TUF targets as well).

Upon closer inspection, it is not clear if .well-known is allowed anywhere but off of the root URI... so we are probably breaking spec if it lives at https://pypi.org/simple/.well-known

We could resolve that by doing /.well-known/tuf-meta.json, and have that contain a URI template that can be combined with the base url of the repository, to allow templated locations which would still support all of the use cases above... just adding the constraint that the repository must be able to put something at the root URL, and that the location for TUF must be expressable as a URI template.

I dunno, I'm personally a fan of just saying $APIBASE/.tuf/ or $APIBASE/~tuf/, but if we want to do the well known route I still think it's workable.

@ewdurbin
Copy link

I think that going with well-known is ideal. I think it's a reasonable ask of maintainers of compliant mirrors. Perhaps we should do a very public ask?

Something like "Maintainers of PyPI mirrors! Do you host your mirror at a sub directory like /pypi/ or /simple/? Is serving a file from /.well-known/ not feasible for some reason? Let us know!" from @pypi @thePsf @ThePyPA

@dstufft
Copy link

dstufft commented Aug 14, 2020

To be clear, looking at https://theupdateframework.com/security/ I think if we're only validating the distrubtion files, we lose:

  • Rollback attacks
  • Indefinite freeze attacks
  • Mix-and-match attacks

Unless we've started using the TUF metadata instead of the /simple/ metadata for dependency resolution.. but it doesn't sound like that's the case due to

once there is a distribution URL that needs to be downloaded

and it would also be in violation of PEP 458/503.

@dstufft
Copy link

dstufft commented Aug 14, 2020

I think that going with well-known is ideal. I think it's a reasonable ask of maintainers of compliant mirrors. Perhaps we should do a very public ask?

Something like "Maintainers of PyPI mirrors! Do you host your mirror at a sub directory like /pypi/ or /simple/? Is serving a file from /.well-known/ not feasible for some reason? Let us know!" from @pypi @thePsf @ThePyPA

Should be fine to do that ask, might also be worthwile asking cooper and uh.. whoever is maintaining DevPI these days how they feel about that solution.

@jku
Copy link
Owner Author

jku commented Aug 14, 2020

I'm pretty sure we lose a significant portion of the security promises of TUF if we do that, unless some other mechanism has been added, I think it's also a deviation from PEP 458

You seem to be correct, I've missed that! This is very good to hash out now... I've worked with williams Warehouse branch and I'm pretty sure that does not handle simple indexes at the moment.

I'll spend a bit of time thinking on this (and apparently re-reading the pep) and get back to you on this.

@dstufft
Copy link

dstufft commented Aug 14, 2020

I'll make sure I'm on the call tomorrow incase it's easier to sort it out in a higher bandwidth medium.

@jku
Copy link
Owner Author

jku commented Aug 14, 2020

I'll make sure I'm on the call tomorrow incase it's easier to sort it out in a higher bandwidth medium.

I might not have been invited to that one: I am not aware of a call... Email is [email protected] in case my presence would be helpful (and if timing works for UTC+3).

@dstufft
Copy link

dstufft commented Aug 14, 2020

Your email address is on the invite list already it appears, it would be in about 7.5 hours or so?

@jku
Copy link
Owner Author

jku commented Aug 14, 2020

Huh. I've found the original invite email, it's just not on my calendar... Thanks for mentioning it, I'll be there

@woodruffw
Copy link

The package index HTML will not be verified by TUF, only the actual distribution files -- this is my understanding, @woodruffw can verify.

This was my plan originally, but on closer reading of the PEP:

When updating bin-n metadata for a consistent snapshot, the snapshot process SHOULD also include any new or updated hashes of simple index pages in the relevant bin-n metadata. Note that, simple index pages may be generated dynamically on API calls, so it is important that their output remains stable throughout the validity of a consistent snapshot.

This is slightly annoying to handle, but shouldn't be impossible. It does, however, substantially increase the fragility of TUF target metadata w/r/t inconsequential changes to the simple index (e.g., in the unlikely event of a small typo or necessary HTML change, we'd need to backfill every single target).

@woodruffw
Copy link

It also means that the initial TUF repository setup includes another lengthy generation period, where we ask Warehouse to render the simple index for each project and hash it. I also don't think this is a dealbreaker, just something we'll need to account for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API This issue relates to Warehouse client API
Projects
None yet
Development

No branches or pull requests

4 participants