-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sagemaker Runner script #156
Conversation
scripts/flytekit_sagemaker_runner.py
Outdated
import subprocess | ||
|
||
parser = argparse.ArgumentParser(description="Running sagemaker task") | ||
parser.add_argument('--__FLYTE_SAGEMAKER_CMD__', dest='flyte_sagmaker_cmd', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this will work. Are there limits on the size of the argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kumare3
I don't see Sagemaker's CRD posing any restrictions on that front: https://github.com/aws/amazon-sagemaker-operator-for-k8s/blob/master/release/rolebased/installer.yaml#L1658
The operating system could pose a limit on the length ARG_MAX
on a command line when the command is evaluated by the exec
function, but that's usually in tens to hundreds of thousands of byte/characters
https://stackoverflow.com/a/19355351
https://pubs.opengroup.org/onlinepubs/009695399/basedefs/limits.h.html
but I don't know if there's any implicit/intrinsic limitation posed by SageMaker backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, there is
status:
additional: "Unable to create training job: Unable to create Training Job: ValidationException:
1 validation error detected: Value '{__FLYTE_SAGEMAKER_CMD__=service_venv+pyflyte-execute+--task-module+workflows.custom_sagemaker_training+--task-name+custom_training_task+--inputs+s3://<prefix>/custom/data/inputs.pb+--output-prefix+s3://<prefix>+--my_input+hello
world}' at 'hyperParameters' failed to satisfy constraint: Map value must satisfy
constraint: [Member must have length less than or equal to 256, Member must have
length greater than or equal to 0, Member must satisfy regular expression pattern:
.*]\n\tstatus code: 400, request id: ff068295-a7ce-4fcf-8453-053a3fa9bc31"
self, | ||
max_number_of_training_jobs: int, | ||
max_parallel_training_jobs: int, | ||
training_job: typing.Union[SdkBuiltinAlgorithmTrainingJobTask, CustomTrainingJobTask], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really smart! Could you please add a small unit test for that if you haven’t done that already?
…rs to not specify the fields
scripts/flytekit_sagemaker_runner.py
Outdated
# --__FLYTE_ENV_VAR_env1__ val1 --__FLYTE_ENV_VAR_env2__ val2 | ||
# --__FLYTE_CMD_0_service_venv__ __FLYTE_CMD_DUMMY_VALUE__ | ||
# --__FLYTE_CMD_1_pyflyte-execute__ __FLYTE_CMD_DUMMY_VALUE__ | ||
# --__FLYTE_CMD_2_--task-module__ __FLYTE_CMD_DUMMY_VALUE__ | ||
# --__FLYTE_CMD_3_blah__ __FLYTE_CMD_DUMMY_VALUE__ | ||
# --__FLYTE_CMD_4_--task-name__ __FLYTE_CMD_DUMMY_VALUE__ | ||
# --__FLYTE_CMD_5_bloh__ __FLYTE_CMD_DUMMY_VALUE__ | ||
# --__FLYTE_CMD_6_--output-prefix__ __FLYTE_CMD_DUMMY_VALUE__ | ||
# --__FLYTE_CMD_7_s3://fake-bucket__ __FLYTE_CMD_DUMMY_VALUE__ | ||
# --__FLYTE_CMD_8_--inputs__ __FLYTE_CMD_DUMMY_VALUE__ | ||
# --__FLYTE_CMD_9_s3://fake-bucket__ __FLYTE_CMD_DUMMY_VALUE__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example concludes the format @EngHabu and I agreed on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -4,3 +4,4 @@ | |||
|
|||
HOST = _common_config.FlyteStringConfigurationEntry("statsd", "host", default="localhost") | |||
PORT = _common_config.FlyteIntegerConfigurationEntry("statsd", "port", default=8125) | |||
DISABLED = _common_config.FlyteBoolConfigurationEntry("statsd", "disabled", default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this is necessary? Were you seeing an error in sagemaker? If so, what sets it? Should we set it by default for certain types of jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SageMaker currently doesn't support running side-car containers nor custom AMIs.. so statsD is not an option. It's a flag the plugin will set (not flytekit) so it can control what to set it to (if we deploy custom AMIs that has statsD relay on localhost, the plugin can override the config) I don't think flytekit should make assumptions about the execution environment... generally speaking
This PR aims to adapt Flyte containers to the SageMaker execution environment. ### For details about how SageMaker runs a custom container, refer to flyteorg/flyte#454. In short, the contract that SageMaker follows is to this command to run the container: ```shell docker run <image> train ``` where `train` is a python executable that will be put into `/usr/local/bin` inside the container when you install the `sagemaker-training` library, which is just a very simple python file that calls into the `sagemaker_training` library, the content of which is shown as follows: ```python #!/usr/bin/python3 # -*- coding: utf-8 -*- import re import sys from sagemaker_training.cli.train import main if __name__ == '__main__': sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0]) sys.exit(main()) ``` and this script will eventually call into the script you specified with `SAGEMAKER_PROGRAM` in the Dockerfile. ### Why do we need the runner script Two reasons: 1. This contract command cannot be changed. If we want to use our virtual environment `service_venv`, we need to have an intermediate layer to kick that off 2. The only way to pass non-blob-type inputs to our container is to use the hyperparameter field of the CRD. With that, the inputs will be passed in as command-line arguments. That means we have to pass in the container's `args` and `env_vars` via command line. To pass in ``` `env1=val1 env2=val2 service_venv pyflyte-execute --task-module blah --task-name bloh --output-prefix s3://fake-bucket --inputs s3://fake-bucket` ``` Our flyteplugin need to rewrite it into and our runner script needs to parse it from the following format: ``` --__FLYTE_ENV_VAR_env1__ val1 --__FLYTE_ENV_VAR_env2__ val2 --__FLYTE_CMD_0_service_venv__ __FLYTE_CMD_DUMMY_VALUE__ --__FLYTE_CMD_1_pyflyte-execute__ __FLYTE_CMD_DUMMY_VALUE__ --__FLYTE_CMD_2_--task-module__ __FLYTE_CMD_DUMMY_VALUE__ --__FLYTE_CMD_3_blah__ __FLYTE_CMD_DUMMY_VALUE__ --__FLYTE_CMD_4_--task-name__ __FLYTE_CMD_DUMMY_VALUE__ --__FLYTE_CMD_5_bloh__ __FLYTE_CMD_DUMMY_VALUE__ --__FLYTE_CMD_6_--output-prefix__ __FLYTE_CMD_DUMMY_VALUE__ --__FLYTE_CMD_7_s3://fake-bucket__ __FLYTE_CMD_DUMMY_VALUE__ --__FLYTE_CMD_8_--inputs__ __FLYTE_CMD_DUMMY_VALUE__ --__FLYTE_CMD_9_s3://fake-bucket__ __FLYTE_CMD_DUMMY_VALUE__ ``` We added the prefix and suffix in order to prepare for the future hyperparameter jobs. ### Notable observations and changes: 1. SageMaker will parse everything the user specifies in the hyperParameters field of a TrainingJob CRD or the staticHyperparameters field of a HyperparameterTuningJob CRD into a JSON object. So we can create hyperparameter names and values freely as long as they comply with JSON format. 2. If a hyperparameter has an empty value `""`, it will be ignored by SageMaker. 3. SageMaker currently doesn't support running side-car containers nor custom AMIs, so statsD is not an option. We add a flag the plugin will set (not flytekit) so it can control what to set it to (if we deploy custom AMIs that has statsD relay on localhost, the plugin can override the config)
TL;DR
This PR aims to adapt Flyte containers to the SageMaker execution environment.
For details about how SageMaker runs a custom container, refer to flyteorg/flyte#454.
In short, the contract that SageMaker follows is to this command to run the container:
where
train
is a python executable that will be put into/usr/local/bin
inside the container when you install thesagemaker-training
library, which is just a very simple python file that calls into thesagemaker_training
library, the content of which is shown as follows:and this script will eventually call into the script you specified with
SAGEMAKER_PROGRAM
in the Dockerfile.Why do we need the runner script
Two reasons:
service_venv
, we need to have an intermediate layer to kick that offargs
andenv_vars
via command line.To pass in
Our flyteplugin need to rewrite it into and our runner script needs to parse it from the following format:
We added the prefix and suffix in order to prepare for the future hyperparameter jobs.
Notable observations and changes:
""
, it will be ignored by SageMaker.Type
Are all requirements met?
Tracking Issue
flyteorg/flyte#453