Skip to content
This repository has been archived by the owner on Mar 12, 2024. It is now read-only.

create_merge_pr_job fails transiently #687

Closed
1 task done
timmc-edx opened this issue Aug 2, 2023 · 7 comments · Fixed by #691
Closed
1 task done

create_merge_pr_job fails transiently #687

timmc-edx opened this issue Aug 2, 2023 · 7 comments · Fixed by #691
Assignees
Labels
pipeline-failure Related to deployment pipeline failures.

Comments

@timmc-edx
Copy link
Contributor

timmc-edx commented Aug 2, 2023

AC

  • Fix flakiness of create_merge_pr_job failure

Implementation Details:

  • Add logs and learn more.
  • Question: are there patterns for retrying in tubular, and can we just implement that?

Between July 26 and Aug 2 2023, the create_merge_pr_job of the edxapp_private_public_merge_sync GoCD pipeline has failed with the same unclear error at least 5 times. On re-run, it has passed.

We will likely need to add debug logging, especially of the API response, stack traces, and git state. It might also work to just add retries, if we end up not making progress on this and just want to put a band-aid on it.

Notes

The script backing this job is create_private_to_public_pr.py in the tubular repo.

This job is supposed to convey any merges in edx-platform-private into the public. (The following jobs then convey public changes into the private repo.) However, no such merges have happened for quite some time, as we are following a new process that involves GitHub Security Advisories instead. One interesting bit of timing, though: A private PR was closed (not merged) on July 25. It is unclear why this would have any effect, though.

Here's an example of a failing run:

INFO:tubular.scripts.create_private_to_public_pr:Cloning private repo [email protected]:edx/edx-platform-private.git with branch security-release.
INFO:tubular.scripts.create_private_to_public_pr:Pushing private branch security-release to public repo [email protected]:openedx/edx-platform.git as branch private_to_public_b97007e.
INFO:tubular.scripts.create_private_to_public_pr:No pull request created for merging private_to_public_b97007e into master in '[email protected]:openedx/edx-platform.git' repo - nothing to merge: {'message': 'Validation Failed', 'errors': [{'resource': 'PullRequest', 'field': 'head', 'code': 'invalid'}], 'documentation_url': 'https://docs.github.com/rest/pulls/pulls#create-a-pull-request'}

The traceback includes chained exceptions, with the last one being caused by the delete_branch call: github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/git/refs#get-all-references-in-a-namespace"}

Here's an example of a subsequent passing run, showing a different error that nonetheless does not cause a job failure:

INFO:tubular.scripts.create_private_to_public_pr:No pull request created for merging private_to_public_b97007e into master in '[email protected]:openedx/edx-platform.git' repo - nothing to merge: {'message': 'Validation Failed', 'errors': [{'resource': 'PullRequest', 'code': 'custom', 'message': 'No commits between master and private_to_public_b97007e'}], 'documentation_url': 'https://docs.github.com/rest/pulls/pulls#create-a-pull-request'}

This may indicate some kind of race condition with GitHub and branch creation.

@timmc-edx timmc-edx converted this from a draft issue Aug 2, 2023
@jmbowman jmbowman moved this to Prioritized in Arch-BOM Aug 9, 2023
@robrap
Copy link
Contributor

robrap commented Aug 15, 2023

Noting how often this has been happening:

  • July: 4x
  • August: 4x (first two weeks)

@robrap robrap moved this from Prioritized to On-Call in Arch-BOM Aug 17, 2023
@robrap robrap added the pipeline-failure Related to deployment pipeline failures. label Aug 21, 2023
@timmc-edx timmc-edx self-assigned this Aug 21, 2023
@timmc-edx timmc-edx moved this from On-Call to In Progress in Arch-BOM Aug 21, 2023
@timmc-edx
Copy link
Contributor Author

timmc-edx commented Aug 21, 2023

More comprehensive occurrence information, from looking at the job history in GoCD:

  • April 2023: 1x
  • May 2023: 2x
  • June 2023: 2x
  • July 2023: 5x
  • August 2023: 5x (as of Aug 21; last failure 852246d on Aug 16)

timmc-edx added a commit that referenced this issue Aug 21, 2023
Try to fix #687 by allowing
retries on `delete_branch`. This may prevent the issue we've seen in
`create_private_to_public_pr.py` where the deletion of a newly created
branch occasionally fails with a 404.
@timmc-edx
Copy link
Contributor Author

timmc-edx commented Aug 21, 2023

This might fix it, and if it doesn't then we should at least get more information: #691 (allows retries on deleting a branch when we get a 404 for that)

@timmc-edx
Copy link
Contributor Author

If we see another failure, we should also try seeing if the branch was successfully pushed to GH before we re-run the job.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Arch-BOM Aug 21, 2023
@timmc-edx timmc-edx reopened this Aug 22, 2023
@timmc-edx
Copy link
Contributor Author

Closing on the presumption that it worked; can reopen if that's not the case.

@robrap
Copy link
Contributor

robrap commented Aug 22, 2023

@timmc-edx: Related comments regarding our private runbook for GoCD:

  1. The runbook points to a view that filters out Done items. This makes me wonder if we might want to searches, one where you could review Done items as well in case you might want to reopen one. Thoughts?
  2. The runbook has a comment for you related to documenting adding retries in Tubular. Do you mind reviewing/resolving?
    Thank you.

@timmc-edx
Copy link
Contributor Author

Updated runbook. Not sure what we want to do re: status:Done, but at least it's in a comment there now so that the next person consulting the runbook will consider that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pipeline-failure Related to deployment pipeline failures.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants