Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline nightly build is broken #2738

Closed
afrittoli opened this issue Jun 3, 2020 · 16 comments
Closed

Pipeline nightly build is broken #2738

afrittoli opened this issue Jun 3, 2020 · 16 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@afrittoli
Copy link
Member

Expected Behavior

Pipeline nightly build works

Actual Behavior

Pipeline nightly build is broken.
Building the base image fails with:

fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
v3.12.0-30-g01407813ee [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-29-gb310a5f576 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]
OK: 12726 distinct packages available
(1/1) Upgrading alpine-baselayout (3.2.0-r6 -> 3.2.0-r7)
Executing alpine-baselayout-3.2.0-r7.pre-upgrade
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..data': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/token': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/namespace': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/ca.crt': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..2020_06_03_02_13_55.407209058/namespace': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..2020_06_03_02_13_55.407209058/ca.crt': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..2020_06_03_02_13_55.407209058/token': Read-only file system
Executing alpine-baselayout-3.2.0-r7.post-upgrade
ERROR: alpine-baselayout-3.2.0-r7: failed to rename var/.apk.f752bb51c942c7b3b4e0cf24875e21be9cdcd4595d8db384 to var/run.
Executing busybox-1.31.1-r16.trigger
1 error; 27 MiB in 25 packages
error building image: error building stage: failed to execute command: waiting for process to exit: exit status 1

Steps to Reproduce the Problem

  1. https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/pipeline-release-nightly-tqgdd

Additional Info

The image is base on alpine.
It used to be latest, and now it's pinned on 3.12 which is the version that was used in the last working run: https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/pipeline-release-nightly-w5xcr

The only visible difference in the run log is the following. In the successful run:

fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
v3.12.0-3-gc43b21255b [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-1-g9465f17ea9 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]

while in the failing run:

fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
v3.12.0-30-g01407813ee [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-29-gb310a5f576 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]
@afrittoli
Copy link
Member Author

The https://gitlab.alpinelinux.org/alpine/aports/-/tags/v3.12.0 was released 4dd ago, which is when the last successful nightly happened.
The build on the day before ran on v.3.11.0.
We had one successful build on v3.12.0 and then it started failing.

@vdemeester
Copy link
Member

Read-only file system make me think it's either a node or an image problem 🤔

@dibyom
Copy link
Member

dibyom commented Jun 4, 2020

I can reproduce this on my cluster running Tekton v0.12.0

  1. Apply Task YAML: https://gist.github.com/dibyom/038c9ae01fff69606976971cdb6c4102
  2. Create svc account/secret: https://github.com/tektoncd/pipeline/tree/master/tekton#service-account-and-secrets
  3. Run Task:
 tkn task start \
                --param=imageRegistry=${IMAGE_REGISTRY} \
                --serviceaccount=release-right-meow \
                --inputresource=source=tekton-pipelines-git \
                --outputresource=builtBaseImage=base-image \
               publish-tekton-pipelines

@bobcatfish
Copy link
Collaborator

bobcatfish commented Jun 4, 2020

OK: 12726 distinct packages available
(1/1) Upgrading alpine-baselayout (3.2.0-r6 -> 3.2.0-r7)
Executing alpine-baselayout-3.2.0-r7.pre-upgrade
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..data': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/token': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/namespace': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/ca.crt': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..2020_06_03_02_13_55.407209058/namespace': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..2020_06_03_02_13_55.407209058/ca.crt': Read-only file system
rm: can't remove '/var/run/secrets/kubernetes.io/serviceaccount/..2020_06_03_02_13_55.407209058/token': Read-only file system

It looks like something is trying to remove a mounted secret:

      volumeMounts:
...
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: default-token-g2t44
        readOnly: true

@bobcatfish
Copy link
Collaborator

btw looks like this might be a duplicate of #2726 looks like pinning didnt fix it :S

@bobcatfish
Copy link
Collaborator

Here's a log from a recent successful run, where this "pre-upgrade" thing doesnt seem to be getting involved:

fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
(1/11) Installing ca-certificates (20191127-r2)
(2/11) Installing nghttp2-libs (1.40.0-r0)
(3/11) Installing libcurl (7.69.1-r0)
(4/11) Installing expat (2.2.9-r1)
(5/11) Installing pcre2 (10.35-r0)
(6/11) Installing git (2.26.2-r0)
(7/11) Installing openssh-keygen (8.3_p1-r0)
(8/11) Installing ncurses-terminfo-base (6.2_p20200523-r0)
(9/11) Installing ncurses-libs (6.2_p20200523-r0)
(10/11) Installing libedit (20191231.3.1-r0)
(11/11) Installing openssh-client (8.3_p1-r0)
Executing busybox-1.31.1-r16.trigger
Executing ca-certificates-20191127-r2.trigger
OK: 27 MiB in 25 packages
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
v3.12.0-3-gc43b21255b [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-1-g9465f17ea9 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]
OK: 12725 distinct packages available
OK: 27 MiB in 25 packages

It's interesting that the successful log references these versions:

v3.12.0-3-gc43b21255b [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-1-g9465f17ea9 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]

In the failed log we have these versions:

v3.12.0-30-g01407813ee [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-29-gb310a5f576 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]

@bobcatfish
Copy link
Collaborator

It seems like the error might be coming from:

(1/1) Upgrading alpine-baselayout (3.2.0-r6 -> 3.2.0-r7)
Executing alpine-baselayout-3.2.0-r7.pre-upgrade

maybe that's not actually something we want to upgrade?

we're trying to upgrade all packages:

RUN apk add --update git openssh-client \
&& apk update \
&& apk upgrade

@bobcatfish
Copy link
Collaborator

https://git.alpinelinux.org/aports/tree/main/alpine-baselayout/alpine-baselayout.pre-upgrade

# migrate /var/run directory to /run
if [ -d /var/run ]; then
	cp -a /var/run/* /run 2>/dev/null
	rm -rf /var/run
	ln -s ../run /var/run
fi

wut

@bobcatfish
Copy link
Collaborator

I'm not sure what's going on but I recommend building from older checkouts of pipelines and seeing if this was caused by a change that was introduced in the pipelines repo; if so we can use a binary search to find the problem.

@bobcatfish bobcatfish added the kind/bug Categorizes issue or PR as related to a bug. label Jun 4, 2020
@bobcatfish
Copy link
Collaborator

Just realized that to make 0.13 ill need to fix this - and I'm build cop tomorrow anyway, so no time like the present :D

@bobcatfish bobcatfish self-assigned this Jun 4, 2020
@bobcatfish
Copy link
Collaborator

Okay so I was able to run kaniko locally and reproduce this more or less by introducing the slightly contrived step of mounting a read only file into /var/run:

docker run -v `pwd`:/workspace/go/src/github.com/tektoncd/pipeline -v `pwd`/SECRET.json:/var/run/secrets/SECRET.json:ro -e GOOGLE_APPLICATION_CREDENTIALS=/workspace/go/src/github.com/tektoncd/pipeline/SECRET.json gcr.io/kaniko-project/executor:v0.17.1 --dockerfile=/workspace/go/src/github.com/tektoncd/pipeline/images/Dockerfile --destination=gcr.io/christiewilson-catfactory/pipeline-release-test --context=/workspace/go/src/github.com/tektoncd/pipeline

I got this error:

(1/2) Upgrading alpine-baselayout (3.2.0-r6 -> 3.2.0-r7)
Executing alpine-baselayout-3.2.0-r7.pre-upgrade
rm: can't remove '/var/run/secrets/SECRET.json': Resource busy

I then pinned to 3.11 and it built just fine.

It seems like pinning to 3.12 isn't working b/c 3.12 is a moving target; evne since my comment above (#2738 (comment)) I'm seeing a different version being used when repro-ing:

v3.12.0-43-gfe7417f5c2 [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-44-g288d7f5e51 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]

I'm gonna pin to 3.11 and put a bit more time in to see if I can figure out why this has only started happening and if I should report it somewhere.

bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Jun 4, 2020
We're not quite at the bottom of
tektoncd#2738 but it seems like
alpine 3.12 is having this problem and 3.11 is not. 3.12 seems to be
still being updated which may be why we are seeing different behavior
across runs using 3.12; for now pinning to 3.11 should be a way to be
able to build our releases while we try to understand what's going on.
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Jun 4, 2020
We're not quite at the bottom of
tektoncd#2738 but it seems like
alpine 3.12 is having this problem and 3.11 is not. 3.12 seems to be
still being updated which may be why we are seeing different behavior
across runs using 3.12; for now pinning to 3.11 should be a way to be
able to build our releases while we try to understand what's going on
@bobcatfish
Copy link
Collaborator

bobcatfish commented Jun 4, 2020

I think this is a bizarre collision of kaniko behaviour and alpine relying on /var/run being a symlink to /run so I opened GoogleContainerTools/kaniko#1297

I think our options are:

  1. keep the alpine image pinned (and hope this never starts being a problem for 3.11 - i still dont understand why a script committed in 2017 is only causing this problem now)
  2. fix the problem in kaniko
  3. build with something other than kaniko

tekton-robot pushed a commit that referenced this issue Jun 4, 2020
We're not quite at the bottom of
#2738 but it seems like
alpine 3.12 is having this problem and 3.11 is not. 3.12 seems to be
still being updated which may be why we are seeing different behavior
across runs using 3.12; for now pinning to 3.11 should be a way to be
able to build our releases while we try to understand what's going on
@joshsleeper
Copy link

I think it's just a perfect storm of conditions that could've happened in any prior alpine release, but by chance didn't.

the base images for alpine 3.12 don't have the latest alpine-baselayout for their release yet, and so anything that's trying to build + upgrade from them with a read-only mount anywhere in /var/run/* (and I wager anywhere in /run/* too!) will throw its hands up.

as soon as the alpine base images include that package upgrade, this issue will mostly disappear until the next perfect storm. 😆

@bobcatfish
Copy link
Collaborator

ahh makes sense @joshsleeper ! thanks for explaining :D do you happen to know how one could track this kind of thing (e.g. are there release notes somewhere that mention this?) np if not, thanks anyway for the info

image

@joshsleeper
Copy link

joshsleeper commented Jun 5, 2020

for all the things people, including myself, love about Alpine Linux, I think it's a fairly small crew running that ship so their announcement processes aren't extensive. They have a "Latest Development" feed on their homepage that tracks package updates (which is really just a feed of commits), but I think that's about it?

https://www.alpinelinux.org/

I think it's mostly a side-effect of the fact that by design alpine doesn't maintain package history really, so generally the only correct version of alpine packages to be using is the latest. if there are upgrades, you're supposed to have them full stop.

when a major release is being cut (e.g. 3.11, 3.12, etc.) they commit to pin to specific package versions (let's say something like python3.7, which might be 3.7.0 at time of release), but as bug and security fixes roll out they'll replace the python3.7 with 3.7.1 and 3.7.2, at which point there is no longer a way to explicitly install 3.7.0 in that release of alpine.

hopefully that's more helpful than man-splain-y. I've just had to dig into this before at my own company to understand why we had various odd issues with alpine that we never had with other distros.

It's worth noting that I think this probably isn't a kaniko issue really so much as it's primarily exposed by kaniko. k8s is what mounts those secrets there, and it just so happens that people aren't running apk upgrade in many contexts other than an image build!

I'm still undecided on if this should be fixed by alpine or k8s, but I'm guessing it'll end up being alpine since other distros aren't having similar issues... that I know of.

@bobcatfish
Copy link
Collaborator

I think we've successfully working around this and it seems like GoogleContainerTools/kaniko#1297 probably won't have a solution for a while. Considering this resolved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants