Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaner kubectl port-forward retry logic #2593

Conversation

balopat
Copy link
Contributor

@balopat balopat commented Aug 1, 2019

This clears up the port forwarding logic on the kubectlForwarder level.
I got rid of the two polling wait loops, instead there is one single loop that

  • runs until the port forward entry gets cancelled by skaffold itself.
  • it monitors the kubectl port-forward logs in one separate go routine and kills the process in case of known errors - new known error introduced: error upgrading connection
  • It reruns kubectl port-forward in case the process exits on failure (or gets killed by the log monitoring go routine)
  • pod-running-timeout is now more aggressive, 1 second for more responsive UX

Also, added logic to handle port collision right before kicking off the port forwarding process. If that happens, no new port is being brokered, in the hope that the user will kill the (assumed) external process. In case the colliding process dies, the port forwarding will resume (with a nice recovery message to the user).

Also, we found a bug in the portForwarder map LoadOrStore function was not storing anything. This is fixed now with a copy paste of the unit test that was testing the port brokering logic on sync.Map. This copy paste should go away when we implement #2503.

Fixed the retry logic for the TestDevPortForwardGKELoadBalancer integration test, plus fixed the lingering portforwarding processes by sending SIGTERM to the skaffold dev process instead of just killing it.

@codecov
Copy link

codecov bot commented Aug 1, 2019

Codecov Report

Merging #2593 into release/v0.35.0 will decrease coverage by 0.05%.
The diff coverage is 32.07%.

Impacted Files Coverage Δ
...ffold/kubernetes/portforward/port_forward_entry.go 100% <ø> (ø) ⬆️
...kubernetes/portforward/port_forward_integration.go 0% <0%> (ø) ⬆️
...affold/kubernetes/portforward/kubectl_forwarder.go 0% <0%> (ø) ⬆️
pkg/skaffold/util/port.go 86.66% <100%> (+0.95%) ⬆️
...g/skaffold/kubernetes/portforward/pod_forwarder.go 86.95% <100%> (ø) ⬆️
pkg/skaffold/runner/portforwarder.go 100% <100%> (ø) ⬆️
...ffold/kubernetes/portforward/resource_forwarder.go 56% <100%> (+4.97%) ⬆️
...g/skaffold/kubernetes/portforward/entry_manager.go 93.47% <94.11%> (+6.24%) ⬆️

@balopat
Copy link
Contributor Author

balopat commented Aug 1, 2019

I am missing unit tests + there is a scenario that is not checked now (not sure if it was covered before): kubectl portforward does not fail if the port is already bound :(
I managed to create a little adversarial scenario where a competing process takes the port and kubectl port forward just runs in the background happily not throwing any errors.


if strings.Contains(s, "error forwarding port") ||
strings.Contains(s, "unable to forward") ||
strings.Contains(s, "error upgrading connection") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to log a warning or something so the user knows the retry is happening and why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is there in trace mode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep thinking about what you mentioned: "error upgrading connection" is from port forwarding in client-go: https://github.com/kubernetes/client-go/blob/master/tools/portforward/portforward.go#L194 - I think it is fine to retry on it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure Cloud Code doesn't show trace — it would be good for users to at least be able to discover that there was a failure.

I think restarting on error makes a lot of sense: it's essentially turning kubectl port-forward into a single-use attempt. Perhaps this code could look for the more general error header (portforward.go:.*error occurred) and log the error but suppress when one of these strings (known issues)?


func (k *KubectlForwarder) forward(parentCtx context.Context, pfe *portForwardEntry) {
var notifiedUser bool
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when reading this, it looks like this forward runs forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does, until the pfe gets cancelled (there are 3 return statements in the body, all around cancellation!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got lost in reading and saw only 1 return statement.
I still feel, all the 3 return statements are waiting for conditions that are

  1. pfe.terminated. - not sure when that happens
  2. & 3. user enter ^C in skaffold dev .

Will it keep trying forever untill above two conditions are satisfied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, if things are going well, it is going to wait in cmd.Wait() as long as kubectl portforward is running ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but yes, it does retry unless it is an explicit cancel from the skaffold process...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw this was generated on planttext.com :)

@startuml

title Port Forward State Model
[*] --> CheckForTermination
CheckForTermination -down-> Cancelled
CheckForTermination -down-> TryCmdStart
TryCmdStart -down-> Cancelled
TryCmdStart --> CheckForTermination
TryCmdStart -down-> LogMonitoring

Cancelled -down-> [*]
state "kubectl port-forward logs are monitored" as LogMonitoring {
  TryCmdWait -down-> Cancelled
  TryCmdWait -up-> CheckForTermination
}

@enduml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I could put labels on the arrows - but the key thing is non-cancellation error scenarios go back to start, cancellation exists, no error continues to next step.

@balopat balopat changed the title experimenting with cleaner port forward retry logic Cleaner port forward retry logic Aug 1, 2019
@balopat balopat marked this pull request as ready for review August 1, 2019 20:38
@balopat balopat changed the title Cleaner port forward retry logic Cleaner kubectl port-forward retry logic Aug 1, 2019
balopat and others added 3 commits August 1, 2019 15:41
Now that we cleanup "skaffold dev", the TestEvent was failing.
The first run of this test `TestEvent/v1/event_log` was building images, and at
the end of the dev loop prune was cleaning up all of the containers.

In the second test `TestEvent/v1/events`, the images were being built
but upon push failed on this error:

```
The push refers to repository [gcr.io/k8s-skaffold/test-dev]
time="2019-08-01T18:32:12-07:00" level=fatal msg="build failed: building [gcr.io/k8s-skaffold/test-dev]: build artifact: tag does not exist: gcr.io/k8s-skaffold/test-dev:v0.34.1-64-g1d5d3b68-dirty"
```

I think this is because docker was pointing to layers it thought it
had already built, but actually didn't exist anymore.

This test is fixed by adding `--no-prune=true`.
@priyawadhwa
Copy link
Contributor

cc @balopat looks like there was a prune error now that we are cleaning up on TestEvent -- logs here.

I think the issue is that we first run TestEvent/v1/event_log which builds image with digest 77ae371cff1b. This is then cleaned up by prune. We then try to run TestEvent/v1/events, which builds the same image 77ae371cff1b, claims it's in the cache, but then actually fails to push the image because it doesn't exist.

I added --no-prune=true to get around this.

@balopat balopat requested a review from loosebazooka as a code owner August 2, 2019 04:38
@priyawadhwa priyawadhwa force-pushed the cleaner_port_forward_retry branch from f3cc147 to f2defe8 Compare August 2, 2019 04:39
@balopat balopat changed the base branch from master to release/v0.35.0 August 2, 2019 15:50
@balopat
Copy link
Contributor Author

balopat commented Aug 2, 2019

Merging this to release branch.

@balopat balopat merged commit 4b959ca into GoogleContainerTools:release/v0.35.0 Aug 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants