Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sagemaker Runner script #156

Merged
merged 57 commits into from
Sep 2, 2020
Merged

Add Sagemaker Runner script #156

merged 57 commits into from
Sep 2, 2020

Conversation

EngHabu
Copy link
Collaborator

@EngHabu EngHabu commented Aug 13, 2020

TL;DR

This PR aims to adapt Flyte containers to the SageMaker execution environment.

For details about how SageMaker runs a custom container, refer to flyteorg/flyte#454.

In short, the contract that SageMaker follows is to this command to run the container:

docker run <image> train

where train is a python executable that will be put into /usr/local/bin inside the container when you install the sagemaker-training library, which is just a very simple python file that calls into the sagemaker_training library, the content of which is shown as follows:

#!/usr/bin/python3

# -*- coding: utf-8 -*-
import re
import sys

from sagemaker_training.cli.train import main

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

and this script will eventually call into the script you specified with SAGEMAKER_PROGRAM in the Dockerfile.

Why do we need the runner script

Two reasons:

  1. This contract command cannot be changed. If we want to use our virtual environment service_venv, we need to have an intermediate layer to kick that off
  2. The only way to pass non-blob-type inputs to our container is to use the hyperparameter field of the CRD. With that, the inputs will be passed in as command-line arguments. That means we have to pass in the container's args and env_vars via command line.

To pass in

`env1=val1 env2=val2 service_venv pyflyte-execute --task-module blah --task-name bloh --output-prefix s3://fake-bucket --inputs s3://fake-bucket`

Our flyteplugin need to rewrite it into and our runner script needs to parse it from the following format:

 --__FLYTE_ENV_VAR_env1__ val1 --__FLYTE_ENV_VAR_env2__ val2
 --__FLYTE_CMD_0_service_venv__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_1_pyflyte-execute__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_2_--task-module__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_3_blah__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_4_--task-name__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_5_bloh__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_6_--output-prefix__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_7_s3://fake-bucket__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_8_--inputs__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_9_s3://fake-bucket__ __FLYTE_CMD_DUMMY_VALUE__

We added the prefix and suffix in order to prepare for the future hyperparameter jobs.

Notable observations and changes:

  1. SageMaker will parse everything the user specifies in the hyperParameters field of a TrainingJob CRD or the staticHyperparameters field of a HyperparameterTuningJob CRD into a JSON object. So we can create hyperparameter names and values freely as long as they comply with JSON format.
  2. If a hyperparameter has an empty value "", it will be ignored by SageMaker.
  3. SageMaker currently doesn't support running side-car containers nor custom AMIs, so statsD is not an option. We add a flag the plugin will set (not flytekit) so it can control what to set it to (if we deploy custom AMIs that has statsD relay on localhost, the plugin can override the config)

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Tracking Issue

flyteorg/flyte#453

@EngHabu EngHabu marked this pull request as draft August 13, 2020 20:09
import subprocess

parser = argparse.ArgumentParser(description="Running sagemaker task")
parser.add_argument('--__FLYTE_SAGEMAKER_CMD__', dest='flyte_sagmaker_cmd',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this will work. Are there limits on the size of the argument?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kumare3
I don't see Sagemaker's CRD posing any restrictions on that front: https://github.com/aws/amazon-sagemaker-operator-for-k8s/blob/master/release/rolebased/installer.yaml#L1658

The operating system could pose a limit on the length ARG_MAX on a command line when the command is evaluated by the exec function, but that's usually in tens to hundreds of thousands of byte/characters
https://stackoverflow.com/a/19355351
https://pubs.opengroup.org/onlinepubs/009695399/basedefs/limits.h.html

but I don't know if there's any implicit/intrinsic limitation posed by SageMaker backend

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, there is

status:
additional: "Unable to create training job: Unable to create Training Job: ValidationException:
1 validation error detected: Value '{__FLYTE_SAGEMAKER_CMD__=service_venv+pyflyte-execute+--task-module+workflows.custom_sagemaker_training+--task-name+custom_training_task+--inputs+s3://<prefix>/custom/data/inputs.pb+--output-prefix+s3://<prefix>+--my_input+hello
world}' at 'hyperParameters' failed to satisfy constraint: Map value must satisfy
constraint: [Member must have length less than or equal to 256, Member must have
length greater than or equal to 0, Member must satisfy regular expression pattern:
.*]\n\tstatus code: 400, request id: ff068295-a7ce-4fcf-8453-053a3fa9bc31"

self,
max_number_of_training_jobs: int,
max_parallel_training_jobs: int,
training_job: typing.Union[SdkBuiltinAlgorithmTrainingJobTask, CustomTrainingJobTask],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really smart! Could you please add a small unit test for that if you haven’t done that already?

@bnsblue bnsblue requested a review from kumare3 September 2, 2020 02:44
Comment on lines 15 to 25
# --__FLYTE_ENV_VAR_env1__ val1 --__FLYTE_ENV_VAR_env2__ val2
# --__FLYTE_CMD_0_service_venv__ __FLYTE_CMD_DUMMY_VALUE__
# --__FLYTE_CMD_1_pyflyte-execute__ __FLYTE_CMD_DUMMY_VALUE__
# --__FLYTE_CMD_2_--task-module__ __FLYTE_CMD_DUMMY_VALUE__
# --__FLYTE_CMD_3_blah__ __FLYTE_CMD_DUMMY_VALUE__
# --__FLYTE_CMD_4_--task-name__ __FLYTE_CMD_DUMMY_VALUE__
# --__FLYTE_CMD_5_bloh__ __FLYTE_CMD_DUMMY_VALUE__
# --__FLYTE_CMD_6_--output-prefix__ __FLYTE_CMD_DUMMY_VALUE__
# --__FLYTE_CMD_7_s3://fake-bucket__ __FLYTE_CMD_DUMMY_VALUE__
# --__FLYTE_CMD_8_--inputs__ __FLYTE_CMD_DUMMY_VALUE__
# --__FLYTE_CMD_9_s3://fake-bucket__ __FLYTE_CMD_DUMMY_VALUE__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example concludes the format @EngHabu and I agreed on

bnsblue
bnsblue previously approved these changes Sep 2, 2020
Copy link
Contributor

@bnsblue bnsblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

bnsblue
bnsblue previously approved these changes Sep 2, 2020
flytekit/common/tasks/sagemaker/hpo_job_task.py Outdated Show resolved Hide resolved
scripts/flytekit_sagemaker_runner.py Outdated Show resolved Hide resolved
@@ -4,3 +4,4 @@

HOST = _common_config.FlyteStringConfigurationEntry("statsd", "host", default="localhost")
PORT = _common_config.FlyteIntegerConfigurationEntry("statsd", "port", default=8125)
DISABLED = _common_config.FlyteBoolConfigurationEntry("statsd", "disabled", default=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is necessary? Were you seeing an error in sagemaker? If so, what sets it? Should we set it by default for certain types of jobs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SageMaker currently doesn't support running side-car containers nor custom AMIs.. so statsD is not an option. It's a flag the plugin will set (not flytekit) so it can control what to set it to (if we deploy custom AMIs that has statsD relay on localhost, the plugin can override the config) I don't think flytekit should make assumptions about the execution environment... generally speaking

@EngHabu EngHabu merged commit 2bd8b17 into master Sep 2, 2020
@bnsblue bnsblue mentioned this pull request Nov 22, 2020
8 tasks
max-hoffman pushed a commit to dolthub/flytekit that referenced this pull request May 11, 2021
This PR aims to adapt Flyte containers to the SageMaker execution environment.

### For details about how SageMaker runs a custom container, refer to flyteorg/flyte#454.

In short, the contract that SageMaker follows is to this command to run the container:
```shell
docker run <image> train
```
where `train` is a python executable that will be put into `/usr/local/bin` inside the container  when you install the `sagemaker-training` library, which is just a very simple python file that calls into the `sagemaker_training` library, the content of which is shown as follows:
```python
#!/usr/bin/python3

# -*- coding: utf-8 -*-
import re
import sys

from sagemaker_training.cli.train import main

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    sys.exit(main())
```
and this script will eventually call into the script you specified with `SAGEMAKER_PROGRAM` in the Dockerfile.


### Why do we need the runner script
Two reasons:
1. This contract command cannot be changed. If we want to use our virtual environment `service_venv`, we need to have an intermediate layer to kick that off
2. The only way to pass non-blob-type inputs to our container is to use the hyperparameter field of the CRD. With that, the inputs will be passed in as command-line arguments. That means we have to pass in the container's `args` and `env_vars` via command line. 

To pass in 
```
`env1=val1 env2=val2 service_venv pyflyte-execute --task-module blah --task-name bloh --output-prefix s3://fake-bucket --inputs s3://fake-bucket`
```

Our flyteplugin need to rewrite it into and our runner script needs to parse it from the following format:
```
 --__FLYTE_ENV_VAR_env1__ val1 --__FLYTE_ENV_VAR_env2__ val2
 --__FLYTE_CMD_0_service_venv__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_1_pyflyte-execute__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_2_--task-module__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_3_blah__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_4_--task-name__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_5_bloh__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_6_--output-prefix__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_7_s3://fake-bucket__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_8_--inputs__ __FLYTE_CMD_DUMMY_VALUE__
 --__FLYTE_CMD_9_s3://fake-bucket__ __FLYTE_CMD_DUMMY_VALUE__
```
We added the prefix and suffix in order to prepare for the future hyperparameter jobs.

### Notable observations and changes:
1. SageMaker will parse everything the user specifies in the hyperParameters field of a TrainingJob CRD or the staticHyperparameters field of a HyperparameterTuningJob CRD into a JSON object. So we can create hyperparameter names and values freely as long as they comply with JSON format.
2. If a hyperparameter has an empty value `""`, it will be ignored by SageMaker.
3. SageMaker currently doesn't support running side-car containers nor custom AMIs, so statsD is not an option. We add a flag the plugin will set (not flytekit) so it can control what to set it to (if we deploy custom AMIs that has statsD relay on localhost, the plugin can override the config)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants