Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving away from TaskCluster #3317

Open
1 of 3 tasks
lissyx opened this issue Sep 9, 2020 · 47 comments
Open
1 of 3 tasks

Moving away from TaskCluster #3317

lissyx opened this issue Sep 9, 2020 · 47 comments
Assignees
Labels
ci TaskCluster Infra help wanted

Comments

@lissyx
Copy link
Collaborator

lissyx commented Sep 9, 2020

TaskCluster is a CI service provided by Mozilla, and available to both Firefox development (Firefox-CI instance) and Community on Github (Community TaskCluster). It’s being widely used across some Mozilla projects, and it has its own advantages. In our case, the control over tasks, over workers for specific needs and long build time was easier to achieve working with the TaskCluster team rather than relying on other CI services.

However, this has lead to the CI code being very specific to the project, and kind of a source of frustration for non employees trying to send patches and get involved in the project ; specifically because some of the CI parts were “hand-crafted” and triggering builds and tests requires being a “collaborator” on the Github project, which has other implications making it complicated to enable it easily to anyone. In the end, this creates an artificial barrier to contributing to this project, even though we happily run PRs manually, it is still frustrating for everyone. The issue #3228 was an attempt to fix that, but we came to the conclusion it would be more beneficial for everyone to switch to some well known CI service and setup that is less intimidating. While TaskCluster is a great tool and has helped us a lot, we feel its limitations now makes it inappropriate for the project to stimulate and enable external contributions.

We would like to take this opportunity to also enable more contributors to hack and own the code related to CI, so discussion is open.

Issues for GitHub Actions:

@DanBmh
Copy link
Contributor

DanBmh commented Sep 11, 2020

What do you think about GitLabs builtin CI features?

I'm using it for my Jaco-Assistant project and I'm quite happy with it because currently it supports almost all my requirements. The pipeline does linting checks and some code statistics calculation and I'm using it to provide prebuilt container images (You could build and provide the training images from there for example). See my CI setup file here.

There is also an official tutorial for usage with github: https://about.gitlab.com/solutions/github/
And its free for open source projects.

@lissyx
Copy link
Collaborator Author

lissyx commented Sep 11, 2020

What do you think about GitLabs builtin CI features?

That would mean moving to gitlab, which raises other questions. I dont have experience with their ci even though i use gitlab for some personal project (from gitorious.org).

Maybe i should post a detailed explanation of our usage of taskcluster to help there ?

@DanBmh
Copy link
Contributor

DanBmh commented Sep 12, 2020

That would mean moving to gitlab

No, you can use it with github too.


From: https://docs.gitlab.com/ee/ci/ci_cd_for_external_repos/

Instead of moving your entire project to GitLab, you can connect your external repository to get the benefits of GitLab CI/CD.

Connecting an external repository will set up repository mirroring and create a lightweight project with issues, merge requests, wiki, and snippets disabled. These features can be re-enabled later.

To connect to an external repository:

    From your GitLab dashboard, click New project.
    Switch to the CI/CD for external repo tab.
    Choose GitHub or Repo by URL.
    The next steps are similar to the import flow. 

Maybe i should post a detailed explanation of our usage of taskcluster to help there ?

I think this is a good idea. But you should be able to do everything on gitlab ci as soon you can run it in a docker container without special flags.

@lissyx
Copy link
Collaborator Author

lissyx commented Sep 14, 2020

in a docker container

We also need support for Windows, macOS and iOS that cannot be covered by Docker

@lissyx
Copy link
Collaborator Author

lissyx commented Sep 14, 2020

Our current usage of TaskCluster:

We leverage the current features:

  • building a graph of tasks with dependencies: https://github.com/mozilla/DeepSpeech/blob/master/taskcluster/tc-decision.py
  • artifact with indexes: https://community-tc.services.mozilla.com/tasks/index/project.deepspeech
  • building multiple archs:
    • linux/amd64 (via docker-worker)
    • linux/aarch64 (cross-compilation, docker-worker)
    • linux/rpi3 (cross-compilation, docker-worker)
    • android/armv7 (cross-compilation, docker-worker)
    • android/aarch64 (cross-compilation, docker-worker)
    • macOS/amd64 (native, generic-worker, deepspeech-specific hardware deployment, generic-worker)
    • iOS/x86_64 (native, reusing the macOS infra)
    • iOS/aarch64 (native, reusing the macOS infra)
    • Windows/amd64 (native, generic-worker, deepspeech pool managed by taskcluster team)
  • testing on multiple archs:
    • linux/amd64 (docker-worker)
    • linux/aarch64 (native, deepspeech specific hardware, docker-worker)
    • linux/rpi3 (native, deepspeech specific hardware, docker-worker)
    • android/armv7 (docker-worker + nested virt)
    • android/aarch64 (docker-worker + nested virt)
    • macOS/amd64 (native, deepspeech specific hardware deployment, generic-worker)
    • iOS/x86_64 (native, reusing macOS infra)
    • Windows/amd64 (native, generic-worker, deepspeech pool managed by taskcluster team)
    • Windows/CUDA (native, generic-worker with NVIDIA GPU, deepspeech pool managed by taskcluster team)
  • Documentation on ReadTheDocs + Github webhook to generate on PR/push/tag
  • Pushing to repos:
  • Docker Hub via CircleCI
  • Everything else via scriptworker instance running on Heroku:
    • NPM
    • Pypi
    • Nuget
    • JCenter
    • Github

Hardware:

  • Set of GCP VMs for Linux+Android builds/tests
  • Set of AWS VMs for Windows builds/tests
  • 4x MacBook Pro for macOS setups, with VMare Fusion and sets of builds/tests VMs configured
  • ARM hardware self-hosted:
    • 6x LePotato boards for Linux/Aarch64 tests
    • 6x RPi3 boards for Linux/ARMv7 tests
    • DSC_1401

tc-decision.py is in charge of building the whole graph of tasks describing a PR or a Push/Tag:

  • PRs runs tests
  • Push runs builds
  • Tag runs builds + uploads to repositories
  • YAML description files in taskcluster/*.yml to describe tasks
  • dependencies between tasks based on .yml filename (without .yml)
  • decision task created by .taskcluster.yml (canonical entry point of tasckluster / github integration) + taskcluster/tc-schedule.sh
  • https://community-tc.services.mozilla.com/docs
  • LC_ALL=C GITHUB_EVENT="pull_request.synchronize" TASK_ID="aa" GITHUB_HEAD_BRANCHORTAG="branchName" GITHUB_HEAD_REF="refs/heads/branchName" GITHUB_HEAD_BRANCH="branchName" GITHUB_HEAD_REPO_URL="aa" GITHUB_HEAD_SHA="a" GITHUB_HEAD_USER="a" GITHUB_HEAD_USER_EMAIL="a" python3 taskcluster/tc-decision.py --dry
  • LC_ALL=C GITHUB_EVENT="tag" TASK_ID="aa" GITHUB_HEAD_BRANCHORTAG="branchName" GITHUB_HEAD_REF="refs/heads/branchName" GITHUB_HEAD_BRANCH="branchName" GITHUB_HEAD_REPO_URL="aa" GITHUB_HEAD_SHA="a" GITHUB_HEAD_USER="a" GITHUB_HEAD_USER_EMAIL="a" python3 taskcluster/tc-decision.py --dry

Execution encapsulated within bash scripts:

  • Only bash for ease of hacking
  • Re-usable accross all platforms (Linux, macOS, Windows) whereas Docker would cover only Linux
  • TensorFlow build:
    • tf_tc-setup.sh : perform setup steps for TensorFlow builds (install Bazel, CUDA, etc.)
    • tf_tc-build.sh: perform build of TensorFlow
    • tf_tc-package.sh: package the TensorFlow build dir as home.tar.xz for re-use
    • exact re-use of tensorflow is required for Bazel to properly re-use its caching
  • DeepSpeech build
    • same architecture, span over:
    • taskcluster/tc-all-utils.sh
    • taskcluster/tc-all-vars.sh
    • taskcluster/tc-android-utils.sh
    • taskcluster/tc-asserts.sh
    • taskcluster/tc-build-utils.sh
    • taskcluster/tc-dotnet-utils.sh
    • taskcluster/tc-node-utils.sh
    • taskcluster/tc-package.sh
    • taskcluster/tc-py-utils.sh

@opensorceror
Copy link

I have been using GitLab CI (the on-prem community edition) for about three years at my workplace, and so far I have been very happy with it. @lissyx I believe GitLab CI supports all the requirements you listed above - I've personally used most of those features.

The thing I really like about GitLab CI is that it seems to be a very important feature for the company - they release updates frequently.

@lissyx
Copy link
Collaborator Author

lissyx commented Oct 30, 2020

@lissyx I believe GitLab CI supports all the requirements you listed above - I've personally used most of those features.

Don't hesitate if you want, I'd be happy to see how you can do macOS or Windows builds / tests.

@DanBmh
Copy link
Contributor

DanBmh commented Oct 30, 2020

Windows builds might be covered with some of their beta features:
https://about.gitlab.com/blog/2020/01/21/windows-shared-runner-beta/

For iOS I think you would need to create your own runners on the macbooks and link them to the CI. They made a blog post for this:
https://about.gitlab.com/blog/2016/03/10/setting-up-gitlab-ci-for-ios-projects/

@lissyx
Copy link
Collaborator Author

lissyx commented Oct 30, 2020

Windows builds might be covered with some of their beta features:
https://about.gitlab.com/blog/2020/01/21/windows-shared-runner-beta/

For iOS I think you would need to create your own runners on the macbooks and link them to the CI. They made a blog post for this:
https://about.gitlab.com/blog/2016/03/10/setting-up-gitlab-ci-for-ios-projects/

I have no time to take a look at that, sadly.

@lissyx
Copy link
Collaborator Author

lissyx commented Nov 3, 2020

@DanBmh @opensorceror Let me be super-clear: what you shared looks very interesting, but I have no time to dig into that myself. If you guys are willing, please go ahead. One thing I should add is that for macOS, we would really need something to be hosted: the biggest pain was on maintaining this. If we move to GitLab CI but there is still need to babysit those, it's not really worth the effort.

@opensorceror
Copy link

Personally I'm a bit hesitant to work on this by myself, because the CI config of this repo seems too complex for a lone newcomer to tackle.

FWIW, I did a test connecting a GitHub repo with GitLab CI...works pretty well.

I'm not sure where we would find hosted macOS options though.

@lissyx
Copy link
Collaborator Author

lissyx commented Nov 3, 2020

Personally I'm a bit hesitant to work on this by myself, because the CI config of this repo seems too complex for a lone newcomer to tackle.

Of course

FWIW, I did a test connecting a GitHub repo with GitLab CI...works pretty well.

That's nice, i will have a look.

I'm not sure where we would find hosted macOS options though.

That might be the biggest pain point.

@opensorceror
Copy link

Looks like Travis supports macOS builds.

Never used it though, not aware of the limitations if any.

@lissyx
Copy link
Collaborator Author

lissyx commented Nov 5, 2020

FWIW, I did a test connecting a GitHub repo with GitLab CI...works pretty well.

Can it do something like we do with TC, i.e., precompile bits and fetch them at need?
This is super-important, because when you have to rebuild TensorFlow with CUDA, we're talking about hours even on decent systems.

So to overcome this, we have https://github.com/mozilla/DeepSpeech/blob/master/taskcluster/generic_tc_caching-linux-opt-base.tyml + e.g., https://github.com/mozilla/DeepSpeech/blob/master/taskcluster/tf_linux-amd64-cpu-opt.yml

It basically:

  • do a setup + bazel build step on tensorflow with the parameters we need
  • produce a tar we can re-use later
  • store it on taskcluster index infrastructure

Which allows us to have caching we can periodically update, as you can see there: https://github.com/mozilla/DeepSpeech/blob/master/taskcluster/.shared.yml#L186-L260

We use the same mechanisms for many components (SWIG, pyenv, homebrew, etc.) to make sure we can keep build time decent on PRs (~10-20min of build more or less, ~2min for tests) so that a PR can complete under 30-60 mins.

@DanBmh
Copy link
Contributor

DanBmh commented Nov 6, 2020

That would be possible, it's also called artifacts in gitlab. You should be able to run the job periodically or only if certain files did change in the repo.

I'm doing something similar here, saving the following image, which I later use in my readme.

@lissyx
Copy link
Collaborator Author

lissyx commented Nov 6, 2020

That would be possible, it's also called artifacts in gitlab. You should be able to run the job periodically or only if certain files did change in the repo.

I'm doing something similar here, saving the following image, which I later use in my readme.

Nice, and can those be indexed like what TaskCluster has?

@DanBmh
Copy link
Contributor

DanBmh commented Nov 7, 2020

an those be indexed like what TaskCluster has?

Not sure what you mean by this. You can give them custom names or save folders depending on your branch names for example, if this is what you mean.

@lissyx
Copy link
Collaborator Author

lissyx commented Nov 12, 2020

an those be indexed like what TaskCluster has?

Not sure what you mean by this. You can give them custom names or save folders depending on your branch names for example, if this is what you mean.

Ok, I think I will try and use GitLab CI on gitlab for a pet-project of mine that lacks CI :), that will help me get a grasp of the landscape.

@lissyx
Copy link
Collaborator Author

lissyx commented Nov 20, 2020

@DanBmh @opensorceror I have been able to play with a small project of mine with GitLab CI, and I have to admit after scratching the surface, it seems to be nice. I'm pretty sure we can replicate the same things, but obviously it requires rework of the CI handling.

However, I doubt this can work well on a "free tier plan", so I think if there's a move in that direction it will require some investments, including to have support for Windows and macOS. We have been able to get access to our current TaskCluster cost usages, and thanks to the latest optimization we landed back in august, we could run the same workload as previously for a fairly small amount of money.

I guess it's mostly a question of people stepping up and doing, at some point :)

@stepkillah
Copy link
Collaborator

stepkillah commented Dec 2, 2020

@lissyx you can also look into azure pipelines, it has a free tier and self-hosted agents that can be run locally

@lissyx
Copy link
Collaborator Author

lissyx commented Dec 2, 2020

@lissyx you can also look into azure pipelines, it has a free tier and self-hosted agents that can be run locally

Thanks, but I'm sorry I can't spend more time than I already spent, I'm not 100% anymore on DeepSpeech, and I have been spending too much time on it in the past weeks.

@lissyx
Copy link
Collaborator Author

lissyx commented Mar 3, 2021

@DanBmh @opensorceror @stepkillah Do you know something that would allow us to have beefy managed macOS (and Windows) instances on GitLab CI ? After a few weeks of hacking over there, I'm afraid we'd be exactly in the same position as we are today with TaskCluster, with the big difference that we know taskcluster, and we are still in direct contact with the people managing it so fixing issues is quite simple for us.

I insist on beefy, because building tensorflow on the machines we have (MacBook Pro circa 2017, running several VMs) even on bare-metal already takes hours. Now we have some caching in place everywhere to limit the impact, but even brew needs).

@opensorceror
Copy link

@DanBmh @opensorceror @stepkillah Do you know something that would allow us to have beefy managed macOS (and Windows) instances on GitLab CI ? After a few weeks of hacking over there, I'm afraid we'd be exactly in the same position as we are today with TaskCluster, with the big difference that we know taskcluster, and we are still in direct contact with the people managing it so fixing issues is quite simple for us.

I insist on beefy, because building tensorflow on the machines we have (MacBook Pro circa 2017, running several VMs) even on bare-metal already takes hours. Now we have some caching in place everywhere to limit the impact, but even brew needs).

Define "beefy".

@DanBmh
Copy link
Contributor

DanBmh commented Mar 3, 2021

Could you please remind me, what was the reason for building tensorflow ourself?

If building tensorflow is really that complicated an time consuming, wouldn't be using a prebuilt version for all gpu devices and using the tflite runtime (optionally with a non quantized model) for all other devices an easier option?

@lissyx
Copy link
Collaborator Author

lissyx commented Mar 3, 2021

@DanBmh @opensorceror @stepkillah Do you know something that would allow us to have beefy managed macOS (and Windows) instances on GitLab CI ? After a few weeks of hacking over there, I'm afraid we'd be exactly in the same position as we are today with TaskCluster, with the big difference that we know taskcluster, and we are still in direct contact with the people managing it so fixing issues is quite simple for us.
I insist on beefy, because building tensorflow on the machines we have (MacBook Pro circa 2017, running several VMs) even on bare-metal already takes hours. Now we have some caching in place everywhere to limit the impact, but even brew needs).

Define "beefy".

At least 8GB of RAM, preferably 16GB, and at least 8 CPUs.

Could you please remind me, what was the reason for building tensorflow ourself?

If building tensorflow is really that complicated an time consuming, wouldn't be using a prebuilt version for all gpu devices and using the tflite runtime (optionally with a non quantized model) for all other devices an easier option?

libdeepspeech.so statically links TensorFlow, plus we need to have some patches.

We already on TaskCluster have some prebuilding in place, but producing this artifact takes various times

  • ~3h on our current macOS infra
  • ~20min on our current linux builders

So each time we work on TensorFlow (upgrading to newer releases, etc.), it's "complicated". Currently, what we achieve is "sustainable" altough painful. However, given the performances of what I could test on GitLab CI / AppVeyor, it's not impossible this would take our build time much more skyrocketting, and thus it would significantly slow down things.

@reuben
Copy link
Contributor

reuben commented Mar 3, 2021

Wouldn't be the same then also be true for Windows?

Yes. But Windows servers despite rare at least are a thing, and the platform is not as hard to support as macOS. Official TensorFlow builds on macOS don't have GPU support anymore so I don't see how doing the work to move to them would be beneficial.

@reuben
Copy link
Contributor

reuben commented Mar 10, 2021

I've been taking a look at GitHub Actions lately and it seems like a good fit:

  • Free hosted Linux, macOS and Windows workers for open source projects (allows us to move away from self-managed macOS workers right away)
  • Free unlimited package storage for OSS as well which can be used for artifact caching
  • Supports self managed workers for specific hardware or platform needs
  • Attached to the repository so makes onboarding new maintainers easier (and detached from eg. Mozilla IT)

The biggest caveat seems to be that the self-managed workers don't have a really good security story for public repositories, to avoid random PRs with new CI code exploiting your infra. The best solution to that problem seems to be an approach based the idea detailed here:

My use case would be running hardware tests using Github Actions, the only workaround I can see right now is using https://github.blog/changelog/2020-12-15-github-actions-environments-environment-protection-rules-and-environment-secrets-beta/ to have a protected environment. But these approvals depend on the service where you deploy rejecting the request if someone makes a PR where the environment section is removed from the workflow.

You could use this by having the runner under a restricted user and use the environment secrets to elevate it's access, but you would have to make sure they can't do anything malicious under the restricted user which is probably very hard.

The main questions about porting from TaskCluster to GitHub Actions from my looking seem to be:

  • How to replicate our build caching mechanism, specially for TensorFlow builds
    • I think using Docker caching pointing at GitHub Packages might be the easiest solution here, meaning we can eliminate our own caching setup for Linux and Windows jobs. Still need to implement a solution for macOS workers (or do we? If we go ahead with RFC: Drop support for full TensorFlow runtime on macOS #3550 is the TFLite build fast enough to just do it every time?)
  • We have some dependencies from the local build system to TaskCluster artifacts (such as ds-swig), as well as things like util/taskcluster.py. GitHub has an artifacts REST API that we could use to reimplement the logic, but I guess this would be a good opportunity to re-evaluate the need for these hosted artifacts on a case by case basis. For stuff like, download a native_client.tar.xz package from the CLI as opposed to grabbing a link from GitHub Releases, I feel like it's not worth the time investment to port it. But the cases need to be listed out and evaluated.
  • Understanding better how dependencies work across jobs and particular across jobs from different workflows (can we even have multiple workflows for the "CI task group"?) The docs mention a limitation of a maximum of 256 jobs spawned in a job matrix for a single workflow, but it's not clear to me if that applies to just when you use the matrix syntax, if it's per-matrix, or if it's a global limit per workflow.

@reuben
Copy link
Contributor

reuben commented Mar 10, 2021

Some useful things I found in the process:

  • Tool to run workflows locally: https://github.com/nektos/act
  • We can actually do proper manylinux Python package builds using the manylinux_2_24 images (Debian 9)
  • If GitHub Packages Docker registry caching is a viable replacement for our TC artifact + index caching system, we can probably use multi-stage builds (possibly across multiple Dockerfiles) to cleanly separate different levels of needed caching to have the most optimal build times. For example, tasks to build the language binding packages only need a couple of artifacts from the build: libdeepspeech.so and deepspeech.h. With a multi-stage Docker setup you can carve out an image with just those two files and cache it side-by-side with the full image with all build artifacts.

@reuben
Copy link
Contributor

reuben commented Mar 10, 2021

On a specific hardware requirement we have: KVM-enabled VMs. GitHub hosted workers don't have it, but Cirrus CI provides some level of free support for OSS and has KVM enabled workers: https://cirrus-ci.org/guide/linux/#kvm-enabled-privileged-containers

Could be something to look into. Another more exotic possibility I read somewhere is running Android emulator task on macOS hosts. I don't know if that would work on GitHub workers tho and it can also be a net negative due to macOS tasks being harder to maintain than Linux ones.

@reuben
Copy link
Contributor

reuben commented Mar 10, 2021

Posting for posterity here a multi-stage Dockerfile I've been playing with to build Python wheels and also start playing with the caching situation. It's written for our coqui-ai/STT fork but the only difference is repo and artifact names.

FROM quay.io/pypa/manylinux_2_24_x86_64 as base
RUN git clone https://github.com/coqui-ai/STT.git STT
WORKDIR /STT
RUN git submodule sync tensorflow/
RUN git submodule update --init tensorflow/

FROM base as tfbuild
RUN curl -L https://github.com/bazelbuild/bazelisk/releases/download/v1.7.5/bazelisk-linux-amd64 > /usr/local/bin/bazel && chmod +x /usr/local/bin/bazel
WORKDIR /STT/tensorflow/
ENV VIRTUAL_ENV=/tmp/cp36-cp36m-venv
RUN /opt/python/cp36-cp36m/bin/python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
ENV TF_ENABLE_XLA=0
ENV TF_NEED_JEMALLOC=1
ENV TF_NEED_OPENCL_SYCL=0
ENV TF_NEED_MKL=0
ENV TF_NEED_VERBS=0
ENV TF_NEED_MPI=0
ENV TF_NEED_IGNITE=0
ENV TF_NEED_GDR=0
ENV TF_NEED_NGRAPH=0
ENV TF_DOWNLOAD_CLANG=0
ENV TF_SET_ANDROID_WORKSPACE=0
ENV TF_NEED_TENSORRT=0
ENV TF_NEED_ROCM=0
RUN echo "" | TF_NEED_CUDA=0 ./configure
RUN bazel build --workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" --config=noaws --config=nogcp --config=nohdfs --config=nonccl --config=monolithic -c opt --copt=-O3 --copt="-D_GLIBCXX_USE_CXX11_ABI=0" --copt=-fvisibility=hidden //native_client:libstt.so

FROM base as pybase
RUN mkdir -p /STT/tensorflow/bazel-bin/native_client
COPY --from=tfbuild /STT/tensorflow/bazel-bin/native_client/libstt.so /STT/tensorflow/bazel-bin/native_client/libstt.so
WORKDIR /STT/native_client/python
RUN apt-get update && apt-get install -y --no-install-recommends wget && rm -rf /var/lib/apt/lists/*

FROM pybase as py36
ENV VIRTUAL_ENV=/tmp/cp36-cp36m-venv
RUN /opt/python/cp36-cp36m/bin/python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN pip install -U pip
RUN pip install numpy==1.7.0
ENV NUMPY_DEP_VERSION=">=1.7.0"
RUN make bindings TFDIR=/STT/tensorflow

FROM scratch as py36-artifact
COPY --from=py36 /STT/native_client/python/dist/STT-*-cp36* /
$ docker images
REPOSITORY                           TAG             IMAGE ID       CREATED          SIZE
linux-py-wheels                      latest          258d85e4a936   5 minutes ago    10.3MB
linux-py-wheels                      py36-artifact   258d85e4a936   5 minutes ago    10.3MB
linux-py-wheels                      py36            d2413e500df9   5 minutes ago    1.76GB
linux-py-wheels                      pybase          b31d4f23682f   8 minutes ago    1.68GB
linux-py-wheels                      tfbuild         16120c3975b7   31 minutes ago   2.42GB
linux-py-wheels                      base            dba6ce2faceb   54 minutes ago   1.64GB

@reuben
Copy link
Contributor

reuben commented Mar 10, 2021

Looks like generically we can use a sequence somewhat like this:

docker login
docker pull docker.pkg.github.com/.../image:stage1 || true
docker pull docker.pkg.github.com/.../image:stage2 || true
...
docker pull docker.pkg.github.com/.../image:stageN || true

docker build -t image:stage1 --target stage1 --cache-from=docker.pkg.github.com/.../image:stage1 .
docker build -t image:stage2 --target stage2 --cache-from=docker.pkg.github.com/.../image:stage1 --cache-from=docker.pkg.github.com/.../image:stage2 .
...
docker build -t image:stage2 --target stage2 --cache-from=docker.pkg.github.com/.../image:stage1 --cache-from=docker.pkg.github.com/.../image:stage2 ... --cache-from=docker.pkg.github.com/.../image:stageN --out artifacts .
docker tag && docker push
# upload artifacts from artifacts/ folder

Which should cache all the intermediate stages and allow for easy sharing between workflows/jobs as well.

@reuben
Copy link
Contributor

reuben commented Mar 11, 2021

I'm having a hard time getting both the remote and the local cache to work when building multiple targets of the same Dockerfile in a row to be able to tag all the intermediate images. But I found this approach which should let me avoid building more than once: https://forums.docker.com/t/tag-intermediate-build-stages-multi-stage-build/34795

@lissyx
Copy link
Collaborator Author

lissyx commented Mar 12, 2021

So far I have been able to start getting a GitHub Actions workflow "working" for macOS build process:

  • a build job building tensorflow, exposing as a home.tar.xz artifact as on TC
  • a build job building our lib and native client binary, re-using the previous artifacts

I could get this to work end-to-end with the artifact serving as a cache and being properly re-populated / used as expected:

Since yesterday, I'm trying to get full blown tensorflow and while the build of tensorflow itself could complete successfully several times (~3h of build, better than expected), there were issues related to artifact handling, basically making the caching I have put in place not working (artifact missing when I see it on the UI).

Also, this is starting to get a bit messy in the YAML file, but maybe we can refine the workflow into several smaller pieces and rely on https://docs.github.com/en/actions/reference/events-that-trigger-workflows#workflow_run however I still lack proper understanding of Note: This event will only trigger a workflow run if the workflow file is on the default branch. ; it sounds like you would have the YAML infrastructure required to be on the default branch of the repo to run, which would not be the branch you work on ...

@reuben
Copy link
Contributor

reuben commented Mar 12, 2021

Also, this is starting to get a bit messy in the YAML file, but maybe we can refine the workflow into several smaller pieces and rely on https://docs.github.com/en/actions/reference/events-that-trigger-workflows#workflow_run however I still lack proper understanding of Note: This event will only trigger a workflow run if the workflow file is on the default branch. ; it sounds like you would have the YAML infrastructure required to be on the default branch of the repo to run, which would not be the branch you work on ...

Yes. The need to have task definitions already merged before you trigger them makes using workflow_run a no-go, I would say. The development process would be super messy and pollute the commit log.

@lissyx
Copy link
Collaborator Author

lissyx commented Mar 12, 2021

I guess the only viable alternative would be https://docs.github.com/en/rest/reference/actions#create-a-workflow-dispatch-event

And somehow, we will end up re-creating tc-decision.py with less flexibility :). But first, I need to sort out this mess of tensorflow artifact, it worked well on the tflite prebuild only, and it's a mess since I do full tensorflow + tflite ; there's no error, the upload seems to succeed, it's weird.

@reuben
Copy link
Contributor

reuben commented Mar 12, 2021

I hope by viable you're joking hehe. That's just as as workflow_run :P

@lissyx
Copy link
Collaborator Author

lissyx commented Mar 12, 2021

I hope by viable you're joking hehe. That's just as as workflow_run :P

No, I was actually serious, there's no mention of the limitations of workflow_run for this one, but you have to play with the API to trigger the workflow.

@reuben
Copy link
Contributor

reuben commented Mar 12, 2021

I think re-creating tc-decision is switching TC for an bad TC imitation. We should try to stick to the happy path in Google Actions as much as possible. Dealing with repetitive YAML files are way better than having to understand a custom dispatch solution.

@lissyx
Copy link
Collaborator Author

lissyx commented Mar 12, 2021

I think re-creating tc-decision is switching TC for an bad TC imitation. We should try to stick to the happy path in Google Actions as much as possible. Dealing with repetitive YAML files are way better than having to understand a custom dispatch solution.

Yes, that was implied in my comment, better having smaller scope and/or repetitive but ownable than recreate perfection relying on us and that people would end up rewrite anyway to own it.

@lissyx
Copy link
Collaborator Author

lissyx commented Mar 12, 2021

So:

  • the process on macOS kinda works, but still having a weird issue of artifact uploaded but not found until running a new workflow
  • tflite seems to build, package, and ./deepspeech --version and ./deepspeech --help works
  • tf seems to build, package, but ./deepspeech --version and ./deepspeech --help dont works, libdeepspeech is somehow corrupted

Weirdly, it's only corrupted in native_client.tar.xz artifact, not within libdeepspeech.zip one ...

@lissyx
Copy link
Collaborator Author

lissyx commented Mar 15, 2021

So:

* the process on macOS kinda works, but still having a weird issue of artifact uploaded but not found until running a new workflow

* tflite seems to build, package, and `./deepspeech --version` and `./deepspeech --help` works

* tf seems to build, package, but `./deepspeech --version` and `./deepspeech --help` dont works, `libdeepspeech` is somehow corrupted

Weirdly, it's only corrupted in native_client.tar.xz artifact, not within libdeepspeech.zip one ...

I could get something green, using gnutar instead of the default bsdtar. So there are just oddities about the interaction between artifacts and workflow on GitHub Actions now.

@lissyx
Copy link
Collaborator Author

lissyx commented Mar 15, 2021

* Still need to implement a solution for macOS workers (or do we? If we go ahead with #3550 is the TFLite build fast enough to just do it every time?)

We can get caching via GitHub Actions artifacts. A full blown tensorflow build seem to be consitent ~3h on their hardware (good surprise, I was expecting much worse), and a TFLite only build is ~10m ; current door-to-door workflow is ~15m when re-using cache of full blown tensorflow ; it would be ~25-30 min with no cache and only TFLite.

Currently, it requires a small piece of specific JS code for a specific GitHub Actions implem, because the default handling of artifacts makes it too tied to the workflow. This needs to be reviewed to ensure it is an acceptable augmentation of the surface of code.

@lissyx
Copy link
Collaborator Author

lissyx commented Mar 19, 2021

heads up, I have opened a first PR to discuss #3563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci TaskCluster Infra help wanted
Projects
None yet
Development

No branches or pull requests

5 participants