Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling in ShellTask #1732

Merged
merged 3 commits into from
Jul 18, 2023
Merged

Conversation

pradithya
Copy link
Member

TL;DR

Improve the error returned by ShellTask. As of now, ShellTask error message is not helpful since it doesn't capture the stderr from the subprocess. Additionally, the error thrown by ShellTask is currently classified as system error instead of user error.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added

Complete description

  • Raise FlyteRecoverableError from ShellTask so that it's classified as user's error and the retry policy specified in the task is respected.
  • Improve the error message captured by ShellTask. As of now the error message generated by ShellTask in flyteconsole looks as follows. Notice that it doesn't contain any details of the failed command.
[3/3] currentAttempt done. Last Error: SYSTEM::Traceback (most recent call last):

      File "/usr/local/lib/python3.8/site-packages/flytekit/exceptions/scopes.py", line 165, in system_entry_point
        return wrapped(*args, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/flytekit/core/base_task.py", line 527, in dispatch_execute
        raise e
      File "/usr/local/lib/python3.8/site-packages/flytekit/core/base_task.py", line 524, in dispatch_execute
        native_outputs = self.execute(**native_inputs)
      File "/usr/local/lib/python3.8/site-packages/flytekit/extras/tasks/shell.py", line 220, in execute
        subprocess.check_call(gen_script, shell=True)
      File "/usr/local/lib/python3.8/subprocess.py", line 364, in check_call
        raise CalledProcessError(retcode, cmd)

Message:

    Command '
        export RUN_DATE=2023-07-04 &&
        make run
    ' returned non-zero exit status 2.

SYSTEM ERROR! Contact platform administrators.

With this change, users would be able to see stdout and stderr of the subprocess as part of the error message.

Tracking Issue

flyteorg/flyte#3559

Follow-up issue

N.A.

@welcome
Copy link

welcome bot commented Jul 11, 2023

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

Pradithya Aria added 2 commits July 11, 2023 23:58
Signed-off-by: Pradithya Aria <[email protected]>
Signed-off-by: Pradithya Aria <[email protected]>
Signed-off-by: Pradithya Aria <[email protected]>
raise
logger.error(error)
# raise FlyteRecoverableException so that it's classified as user error and will be retried
raise FlyteRecoverableException(error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not every user error should be a recoverable one 🤔

Take a look at how PythonFunctionTask separates user vs system error scopes:

return exception_scopes.user_entry_point(self._task_function)(**kwargs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not every user error should be a recoverable one

I agree, however, since it's a shell execution I am not sure how can we differentiate recoverable vs non-recoverable user error. I am defaulting to recoverable error so that the users of ShellTask can still leverage the task retry feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, of course, makes sense 👍
This change is not really backwards compatible though, I wonder whether this is a dealbreaker 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is undoubtedly a different behavior change when a shell task throws an error.

Before: Error in shell task is considered system error thus retried N number of times, where N is 3 by default.

After: Error in shell task is considered user recoverable error. By default, it's not retried and will only be retried if users specify retry in the task's metadata.

However, the shell task didn't respect the retry strategy set by users, and I think it can be considered as bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wild-endeavor your opinion will be very helpful here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think shell task should throw user recoverable error

"""
process = subprocess.Popen(script, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=0, shell=True, text=True)

# print stdout so that long-running subprocess will not appear unresponsive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Currently, the pod's log didn't show any stdout and stderr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it does. However this PR capture the stdout thus need to emit it manually.

@pingsutw pingsutw merged commit 604cabd into flyteorg:master Jul 18, 2023
@welcome
Copy link

welcome bot commented Jul 18, 2023

Congrats on merging your first pull request! 🎉

pradithya pushed a commit to pradithya/flytekit that referenced this pull request Jul 19, 2023
* Improve error handling in ShellTask

Signed-off-by: Pradithya Aria <[email protected]>

* Add new line

Signed-off-by: Pradithya Aria <[email protected]>

* Capture stdout

Signed-off-by: Pradithya Aria <[email protected]>

---------

Signed-off-by: Pradithya Aria <[email protected]>
Co-authored-by: Pradithya Aria <[email protected]>
pingsutw pushed a commit that referenced this pull request Jul 21, 2023
* Improve error handling in ShellTask



* Add new line



* Capture stdout



---------

Signed-off-by: Pradithya Aria <[email protected]>
Co-authored-by: Pradithya Aria <[email protected]>
fg91 pushed a commit that referenced this pull request Aug 15, 2023
* Improve error handling in ShellTask

Signed-off-by: Pradithya Aria <[email protected]>

* Add new line

Signed-off-by: Pradithya Aria <[email protected]>

* Capture stdout

Signed-off-by: Pradithya Aria <[email protected]>

---------

Signed-off-by: Pradithya Aria <[email protected]>
Co-authored-by: Pradithya Aria <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants