Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verifiably reproducible build artefacts #1269

Closed
joshuagl opened this issue Jan 28, 2021 · 16 comments
Closed

Verifiably reproducible build artefacts #1269

joshuagl opened this issue Jan 28, 2021 · 16 comments
Labels
backlog Issues to address with priority for current development goals enhancement

Comments

@joshuagl
Copy link
Member

Description of issue or feature request:

We can give users of our release artefacts (tarballs and wheels) greater confidence in the integrity of the artefacts and our development processes if they can verify that the artefacts we produced correspond to the signed tagged source code for the release.

We can achieve this through implementing reproducible builds.

Current behavior:

Tarball (sdist) and wheels (bdist_wheel) generated for a release are not verifiably reproducible.

$ diff dist-a dist-b
Binary files dist-a/tuf-0.16.0-py2.py3-none-any.whl and dist-b/tuf-0.16.0-py2.py3-none-any.whl differ
Binary files dist-a/tuf-0.16.0.tar.gz and dist-b/tuf-0.16.0.tar.gz differ

Expected behavior:

Tarball (sdist) and wheels (bdist_wheel) generated for a release are verifiably reproducible.

@joshuagl joshuagl changed the title Reproducible build artefacts Verifiably reproducible build artefacts Jan 28, 2021
@joshuagl
Copy link
Member Author

Ensuring SOURCE_DATE_EPOCH is set (i.e. this patch) enables us to create verifiably reproducible wheels, but the tarball (sdist) is still not verifiably reproducible.

$ diff dist-r-a dist-r-b
Binary files dist-r-a/tuf-0.16.0.tar.gz and dist-b-r/tuf-0.16.0.tar.gz differ

@joshuagl
Copy link
Member Author

Note: I tried this on macOS and Fedora 33. I don't think the sdist being non-deterministic is host tool related.

@mnm678
Copy link
Contributor

mnm678 commented Jan 28, 2021

tar isn't deterministic by default, do they still differ when generated with --mtime?

@joshuagl
Copy link
Member Author

Quite right, thanks for the link! Our sdist tarballs are generated by setuptools, which (so far as I could tell in the relatively brief time I spent looking today) has no option to specify mtime.

It's possible we just have to brute force this; unpack and re-pack the tarball with --mtime, use strip-nondeterminism, or similar.

I intend to spend some more time on this in the next couple of weeks, unless someone gets to it first.

@sethmlarson
Copy link

Hello! Dropping in to say it'd be nice if these techniques could be made available to the broader Python community somehow.

@joshuagl
Copy link
Member Author

joshuagl commented Feb 9, 2021

There have been pieces of work done in multiple places to support this:

I think the right next step for having a reproducible sdist for tuf is to try and get the above changes accepted into setuptools.

@joshuagl
Copy link
Member Author

joshuagl commented Mar 9, 2021

in #1161 @sechkova has discovered flit

@joshuagl may be interested in flit which has some support for reproducible builds .

@jku
Copy link
Member

jku commented Jan 10, 2022

Updating the state since I started looking where we are with this. Wheel build does seem reproducible and I think this is how we want to do it:

# Use latest commit date as epoch
SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct) python3 -m build

The source tarball issue still persists. I have a home grown diff solution though:

import hashlib
import sys
import tarfile

chunk_size = 100000
tar = tarfile.open(sys.argv[1])
content_hash = hashlib.sha256()

for member in tar:
    if not member.isfile():
        continue
    member_file = tar.extractfile(member)
    data = member_file.read(chunk_size)
    while data:
        content_hash.update(data)
        data = member_file.read(chunk_size)

print(content_hash.hexdigest())

that prints a hash for tarball file contents (disregarding owner, group, mtime, etc metadata). It does care about file order.

This is slightly too simplified -- each file should get hashed on it's own, and filename should matter -- but that's very close to the process that would work.

@joshuagl
Copy link
Member Author

Updating the state since I started looking where we are with this. Wheel build does seem reproducible and I think this is how we want to do it:

# Use latest commit date as epoch
SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct) python3 -m build

As recommended by the SOURCE_DATE_EPOCH documentation 👍

The source tarball issue still persists. I have a home grown diff solution though:

import hashlib
import sys
import tarfile

chunk_size = 100000
tar = tarfile.open(sys.argv[1])
content_hash = hashlib.sha256()

for member in tar:
    if not member.isfile():
        continue
    member_file = tar.extractfile(member)
    data = member_file.read(chunk_size)
    while data:
        content_hash.update(data)
        data = member_file.read(chunk_size)

print(content_hash.hexdigest())

that prints a hash for tarball file contents (disregarding owner, group, mtime, etc metadata). It does care about file order.

That's neat. We could write/maintain a little tool to diff tarballs based on their content hashes. Would be nice to help fix this for other Python projects too, but not something we should block python-tuf activities on.

@jku
Copy link
Member

jku commented Mar 9, 2022

The goal of this work IMO should be a tool verify-release that anyone can run. Tool would:

  • build tarball & wheel locally
  • download tarball & wheel from github
  • download tarball & wheel from pypi using pip
  • verify that that all three are the same or complain loudly

As a first step we could add running the tool to the release instructions.
As a second step, we could run the tool on CI regularly (for each release from 1.0 onwards)
A future task that builds on top of the tool would be to actually do the release build in a github deployment action (so maintainer can then verify locally).
As an added feature the tool could check for signatures

@jku jku added the backlog Issues to address with priority for current development goals label Mar 9, 2022
@joshuagl joshuagl mentioned this issue Mar 9, 2022
3 tasks
@ofek
Copy link
Contributor

ofek commented Mar 13, 2022

Potential solution: #1896 (comment)

@jku
Copy link
Member

jku commented Mar 15, 2022

So reproducible builds:

  • ofek has a potential solution for tarball repro in Update package metadata #1896
  • I've got a branch with a verify script (compares local build result to pypi and github release): https://github.com/jku/python-tuf/blob/verify-release/verify_release -- it's a bit rough but something like this should be useful for
    • any developer to verify the release even when release is done manually
    • CI to periodically verify releases (a canary check)
    • after we move to automated CD releases, the release manager can verify the release as it's being made
  • we should at least consider the build tool pinning issue: currently build dependencies (hatchling at least?) are not pinned.
    @ofek do you have opinions about this? I'd like to make sure the build (made from a specific commit) is the same if I run it today or if I run it 6 months from now. I think that suggests we want to pin at least the hatchling version (and plan to bump the pinned version as new hatchling releases are made) but it's possible I'm missing some detail: I'm not very familiar with the python build ecosystem. If we do want to pin hatchling, what about transient deps (currently editables, pluggy, packaging, pathspec, tomli): bumping those feels like a lot of work...

@lukpueh
Copy link
Member

lukpueh commented Mar 15, 2022

It works for tarball and wheel repro. Or did you not mention wheels because setting SOURCE_DATE_EPOCH in a deterministic way is all that's needed? Ofek's wheel builder uses a constant, which IMO is just as fine as the latest git commit.

  • I've got a branch with a verify script (compares local build result to pypi and github release):

Cool stuff! Maybe at a later point we can generate in-toto metadata for builds and use in-toto also to verify them. I'll add a comment to #529.

  • we should at least consider the build tool pinning issue: currently build dependencies (hatchling at least?) are not pinned.

Not sure if we need to pin the build dependency. I think it would be enough to keep a record of the build environment. But there doesn't seem to be a canonical way for python projects yet.

@jku
Copy link
Member

jku commented Mar 15, 2022

did you not mention wheels because setting SOURCE_DATE_EPOCH in a deterministic way is all that's needed

Yes correct, I did not mean to imply 1896 didn't handle this.

Not sure if we need to pin the build dependency. I think it would be enough to keep a record of the build environment.

Yeah that's the other option... I just think it might be easiest to "document" it in pyproject.toml :) This could have some consequences I'm not seeing at the moment though

@ofek
Copy link
Contributor

ofek commented Mar 18, 2022

we should at least consider the build tool pinning issue [...] @ofek do you have opinions about this? I'd like to make sure the build (made from a specific commit) is the same if I run it today or if I run it 6 months from now

Sure, that makes sense to pin

what about transient deps (currently editables, pluggy, packaging, pathspec, tomli)

No need IMO

@jku
Copy link
Member

jku commented Apr 6, 2022

I think I'll close this: build is reproducible, and there is a script to check that. These are the main items here.

Build environment maybe should be pinned, or at least documented in the build artefacts but I think we can handle that as future work after #1550...

@jku jku closed this as completed Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Issues to address with priority for current development goals enhancement
Projects
None yet
Development

No branches or pull requests

6 participants