-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sagemaker Runner script #156
Merged
Merged
Changes from all commits
Commits
Show all changes
57 commits
Select commit
Hold shift + click to select a range
0e97b24
Add Sagemaker Runner script
EngHabu 6aa374f
fix split call
EngHabu d1112c3
Another split update
EngHabu 1c797f4
add training job name to the output location
EngHabu 3935d23
typo
EngHabu 915a9e4
Change separator
EngHabu 6cd96a6
fixes
EngHabu 816aa71
Merge branch 'master' of github.com:lyft/flytekit into custom-sagemaker
bnsblue bd003f0
update runner to strip flyte prefices and suffices
bnsblue 3eab42d
fixing runner script
bnsblue 132d8b0
make fmt
bnsblue 2a330aa
add logic to handle env vars
bnsblue 8bf3557
Update cmd and env var cmd line args
EngHabu 6f8f2f4
Merge parent
EngHabu fd924fa
maxsplit=1 for env var split
EngHabu 54c0c42
maxsplit=1 for env var split
EngHabu 05a55a1
Add custom job task
EngHabu 2b4b397
Add algorithm spec
EngHabu 5980fde
Fix quotes bug
EngHabu c5b9682
use chang-hong's version of runner script
bnsblue 80afc21
lint; dep; fix custom task wrapper
bnsblue e2ad7d8
typo
bnsblue 8b0f79b
fix sdk custom task
bnsblue 5071365
add correct task type for custom training
bnsblue cc667b4
tidy up runner script
EngHabu ea19ce5
Add example to runner script
EngHabu b8a0d1b
Update command to enforce ordering
EngHabu 672848d
add an option to disable statsd
bnsblue 7efb26a
add a dummy stats client
bnsblue 5f1cdcf
comment
bnsblue b2f8f7b
lint
bnsblue 47ea769
injecting FLYTE_STATSD_DISABLED=True from CustomTrainingJobTask
bnsblue bf9b5bb
reverting the injection
bnsblue f28bb03
add comments for the processing of __FLYTE_CMD_DUMMY_VALUE__
bnsblue ce1d8da
fix runner script dict traversal
bnsblue 71d74e3
fmt and lint
bnsblue 4ee927a
point dummy client to localhost
bnsblue 86d7cd8
fix unit test
bnsblue 6a557e7
Merge branch 'master' of github.com:lyft/flytekit into custom-sagemaker
bnsblue 9c0c240
fixing unit test
bnsblue 01f9ed3
fixing unit test
bnsblue b074a8a
Add typing hints
EngHabu 0ee41bf
Wip
EngHabu 9b165c5
Fix broken test
EngHabu 6992e86
lint
EngHabu 9aeef72
add default values for AlgorithmSpecification parameters to allow use…
bnsblue c864973
Merge branch 'custom-sagemaker' of github.com:lyft/flytekit into cust…
bnsblue c9ff0f4
fmt and lint
bnsblue 1b8048c
refactor runner script and add a unit test
bnsblue dfce02e
add test to travis
bnsblue 741837e
revert log level
bnsblue 48b228f
PR Comments
EngHabu 46fedea
Reformat
EngHabu 827e1e1
reformat imports
EngHabu a3b1914
reformat, again
EngHabu 8064e2e
reformat
EngHabu dd02dee
Merge branch 'master' into custom-sagemaker
wild-endeavor File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,4 +2,4 @@ | |
|
||
import flytekit.plugins # noqa: F401 | ||
|
||
__version__ = "0.12.5" | ||
__version__ = "0.12.6" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 0 additions & 2 deletions
2
...mmon/tasks/sagemaker/training_job_task.py → ...s/sagemaker/built_in_training_job_task.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
82 changes: 82 additions & 0 deletions
82
flytekit/common/tasks/sagemaker/custom_training_job_task.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
from google.protobuf.json_format import MessageToDict | ||
|
||
from flytekit.common.constants import SdkTaskType | ||
from flytekit.common.tasks import sdk_runnable as _sdk_runnable | ||
from flytekit.models.sagemaker import training_job as _training_job_models | ||
|
||
|
||
class CustomTrainingJobTask(_sdk_runnable.SdkRunnableTask): | ||
""" | ||
CustomTrainJobTask defines a python task that can run on SageMaker bring your own container. | ||
|
||
""" | ||
|
||
def __init__( | ||
self, | ||
task_function, | ||
cache_version, | ||
retries, | ||
deprecated, | ||
storage_request, | ||
cpu_request, | ||
gpu_request, | ||
memory_request, | ||
storage_limit, | ||
cpu_limit, | ||
gpu_limit, | ||
memory_limit, | ||
cache, | ||
timeout, | ||
environment, | ||
algorithm_specification: _training_job_models.AlgorithmSpecification, | ||
training_job_resource_config: _training_job_models.TrainingJobResourceConfig, | ||
): | ||
""" | ||
:param task_function: Function container user code. This will be executed via the SDK's engine. | ||
:param Text cache_version: string describing the version for task discovery purposes | ||
:param int retries: Number of retries to attempt | ||
:param Text deprecated: | ||
:param Text storage_request: | ||
:param Text cpu_request: | ||
:param Text gpu_request: | ||
:param Text memory_request: | ||
:param Text storage_limit: | ||
:param Text cpu_limit: | ||
:param Text gpu_limit: | ||
:param Text memory_limit: | ||
:param bool cache: | ||
:param datetime.timedelta timeout: | ||
:param dict[Text, Text] environment: | ||
:param _training_job_models.AlgorithmSpecification algorithm_specification: | ||
:param _training_job_models.TrainingJobResourceConfig training_job_resource_config: | ||
""" | ||
|
||
# Use the training job model as a measure of type checking | ||
self._training_job_model = _training_job_models.TrainingJob( | ||
algorithm_specification=algorithm_specification, training_job_resource_config=training_job_resource_config | ||
) | ||
|
||
super().__init__( | ||
task_function=task_function, | ||
task_type=SdkTaskType.SAGEMAKER_CUSTOM_TRAINING_JOB_TASK, | ||
discovery_version=cache_version, | ||
retries=retries, | ||
interruptible=False, | ||
deprecated=deprecated, | ||
storage_request=storage_request, | ||
cpu_request=cpu_request, | ||
gpu_request=gpu_request, | ||
memory_request=memory_request, | ||
storage_limit=storage_limit, | ||
cpu_limit=cpu_limit, | ||
gpu_limit=gpu_limit, | ||
memory_limit=memory_limit, | ||
discoverable=cache, | ||
timeout=timeout, | ||
environment=environment, | ||
custom=MessageToDict(self._training_job_model.to_flyte_idl()), | ||
) | ||
|
||
@property | ||
def training_job_model(self) -> _training_job_models.TrainingJob: | ||
return self._training_job_model |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
import datetime as _datetime | ||
import typing | ||
|
||
from flytekit.common.tasks.sagemaker.custom_training_job_task import CustomTrainingJobTask | ||
from flytekit.models.sagemaker import training_job as _training_job_models | ||
|
||
|
||
def custom_training_job_task( | ||
_task_function=None, | ||
algorithm_specification: _training_job_models.AlgorithmSpecification = None, | ||
training_job_resource_config: _training_job_models.TrainingJobResourceConfig = None, | ||
cache_version: str = "", | ||
retries: int = 0, | ||
deprecated: str = "", | ||
storage_request: str = None, | ||
cpu_request: str = None, | ||
gpu_request: str = None, | ||
memory_request: str = None, | ||
storage_limit: str = None, | ||
cpu_limit: str = None, | ||
gpu_limit: str = None, | ||
memory_limit: str = None, | ||
cache: bool = False, | ||
timeout: _datetime.timedelta = None, | ||
environment: typing.Dict[str, str] = None, | ||
cls: typing.Type = None, | ||
): | ||
""" | ||
Decorator to create a Custom Training Job definition. This task will run as a single unit of work on the platform. | ||
|
||
.. code-block:: python | ||
|
||
@inputs(int_list=[Types.Integer]) | ||
@outputs(sum_of_list=Types.Integer | ||
@custom_task | ||
def my_task(wf_params, int_list, sum_of_list): | ||
sum_of_list.set(sum(int_list)) | ||
|
||
:param _task_function: this is the decorated method and shouldn't be declared explicitly. The function must | ||
take a first argument, and then named arguments matching those defined in @inputs and @outputs. No keyword | ||
arguments are allowed for wrapped task functions. | ||
|
||
:param _training_job_models.AlgorithmSpecification algorithm_specification: This represents the algorithm specification | ||
|
||
:param _training_job_models.TrainingJobResourceConfig training_job_resource_config: This represents the training job config. | ||
|
||
:param Text cache_version: [optional] string representing logical version for discovery. This field should be | ||
updated whenever the underlying algorithm changes. | ||
|
||
.. note:: | ||
|
||
This argument is required to be a non-empty string if `cache` is True. | ||
|
||
:param int retries: [optional] integer determining number of times task can be retried on | ||
:py:exc:`flytekit.sdk.exceptions.RecoverableException` or transient platform failures. Defaults | ||
to 0. | ||
|
||
.. note:: | ||
|
||
If retries > 0, the task must be able to recover from any remote state created within the user code. It is | ||
strongly recommended that tasks are written to be idempotent. | ||
|
||
:param Text deprecated: [optional] string that should be provided if this task is deprecated. The string | ||
will be logged as a warning so it should contain information regarding how to update to a newer task. | ||
|
||
:param Text storage_request: [optional] Kubernetes resource string for lower-bound of disk storage space | ||
for the task to run. Default is set by platform-level configuration. | ||
|
||
.. note:: | ||
|
||
This is currently not supported by the platform. | ||
|
||
:param Text cpu_request: [optional] Kubernetes resource string for lower-bound of cores for the task to execute. | ||
This can be set to a fractional portion of a CPU. Default is set by platform-level configuration. | ||
|
||
TODO: Add links to resource string documentation for Kubernetes | ||
|
||
:param Text gpu_request: [optional] Kubernetes resource string for lower-bound of desired GPUs. | ||
Default is set by platform-level configuration. | ||
|
||
TODO: Add links to resource string documentation for Kubernetes | ||
|
||
:param Text memory_request: [optional] Kubernetes resource string for lower-bound of physical memory | ||
necessary for the task to execute. Default is set by platform-level configuration. | ||
|
||
TODO: Add links to resource string documentation for Kubernetes | ||
|
||
:param Text storage_limit: [optional] Kubernetes resource string for upper-bound of disk storage space | ||
for the task to run. This amount is not guaranteed! If not specified, it is set equal to storage_request. | ||
|
||
.. note:: | ||
|
||
This is currently not supported by the platform. | ||
|
||
:param Text cpu_limit: [optional] Kubernetes resource string for upper-bound of cores for the task to execute. | ||
This can be set to a fractional portion of a CPU. This amount is not guaranteed! If not specified, | ||
it is set equal to cpu_request. | ||
|
||
:param Text gpu_limit: [optional] Kubernetes resource string for upper-bound of desired GPUs. This amount is not | ||
guaranteed! If not specified, it is set equal to gpu_request. | ||
|
||
:param Text memory_limit: [optional] Kubernetes resource string for upper-bound of physical memory | ||
necessary for the task to execute. This amount is not guaranteed! If not specified, it is set equal to | ||
memory_request. | ||
|
||
:param bool cache: [optional] boolean describing if the outputs of this task should be cached and | ||
re-usable. | ||
|
||
:param datetime.timedelta timeout: [optional] describes how long the task should be allowed to | ||
run at max before triggering a retry (if retries are enabled). By default, tasks are allowed to run | ||
indefinitely. If a null timedelta is passed (i.e. timedelta(seconds=0)), the task will not timeout. | ||
|
||
:param dict[Text,Text] environment: [optional] environment variables to set when executing this task. | ||
|
||
:param cls: This can be used to override the task implementation with a user-defined extension. The class | ||
provided must be a subclass of flytekit.common.tasks.sdk_runnable.SdkRunnableTask. A user can use this to | ||
inject bespoke logic into the base Flyte programming model. | ||
|
||
:rtype: flytekit.common.tasks.sagemaker.custom_training_job_task.CustomTrainingJobTask | ||
""" | ||
|
||
def wrapper(fn): | ||
return (cls or CustomTrainingJobTask)( | ||
task_function=fn, | ||
cache_version=cache_version, | ||
retries=retries, | ||
deprecated=deprecated, | ||
storage_request=storage_request, | ||
cpu_request=cpu_request, | ||
gpu_request=gpu_request, | ||
memory_request=memory_request, | ||
storage_limit=storage_limit, | ||
cpu_limit=cpu_limit, | ||
gpu_limit=gpu_limit, | ||
memory_limit=memory_limit, | ||
cache=cache, | ||
timeout=timeout or _datetime.timedelta(seconds=0), | ||
environment=environment, | ||
algorithm_specification=algorithm_specification, | ||
training_job_resource_config=training_job_resource_config, | ||
) | ||
|
||
if _task_function: | ||
return wrapper(_task_function) | ||
else: | ||
return wrapper |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this is necessary? Were you seeing an error in sagemaker? If so, what sets it? Should we set it by default for certain types of jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SageMaker currently doesn't support running side-car containers nor custom AMIs.. so statsD is not an option. It's a flag the plugin will set (not flytekit) so it can control what to set it to (if we deploy custom AMIs that has statsD relay on localhost, the plugin can override the config) I don't think flytekit should make assumptions about the execution environment... generally speaking