Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolver pattern #404

Merged
merged 55 commits into from
Mar 11, 2021
Merged

Resolver pattern #404

merged 55 commits into from
Mar 11, 2021

Conversation

wild-endeavor
Copy link
Contributor

@wild-endeavor wild-endeavor commented Feb 26, 2021

TL;DR

This changes the execution side of flytekit. Previously tasks were loaded only with a module and key to look up through an importlib call. This introduces a "task resolver" construct which is responsible for two things:

  1. Given a task, when serializing be able to produce a series of strings that, at execution time,
  2. allow the resolver to reconstruct the task object.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

  • Please see the comments on the TaskResolverMixin.
  • When a task is serialized, the command that is generated is now the output of the task's resolver.
  • The entrypoint.py click command has been updated to be able to handle that command.

As part of this change, it was found that it was useful to be able to keep track of the variable that a task resolver was assigned to. Because of this, we've added back (duplicated rather) the instance-tracking metaclass mechanism from the old API (the stuff in registerable.py). See the new tests under the tracking/ folder for a non-Flyte-based example of how it works.

  • To simplify things for legacy API tasks, the entrypoint.py execute function has been separated out.
  • PythonAutoContainerTask has been moved into a separate file.
  • Because PythonAutoContainerTasks are now a TrackedInstance we no longer need the InstanceVar construct.

Tracking Issue

https://github.com/lyft/flyte/issues/

Follow-up issue

NA

kumare3 and others added 30 commits February 4, 2021 15:51
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
flytekit/bin/entrypoint.py Outdated Show resolved Hide resolved
flytekit/bin/entrypoint.py Outdated Show resolved Hide resolved
flytekit/bin/entrypoint.py Outdated Show resolved Hide resolved
@@ -158,11 +156,6 @@ def serialize_all(
)
old_style_entities.append(o)

# PythonInstanceTasks will not be picked up by the above, so we need to reiterate
for o, v in load_module_object_for_type(pkgs, PythonInstanceTask, additional_path=local_source_root).items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to double check, we don't need this for #391?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, getting rid of it, using the instance tracking metaclass now.

flytekit/core/python_auto_container.py Outdated Show resolved Hide resolved
@wild-endeavor
Copy link
Contributor Author

flyteorg/flyte#814 end to end tests.

@@ -5,6 +5,7 @@
import pathlib
import random as _random
import traceback as _traceback
from typing import List
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so hypothetically if we were to use the TaskResolver to execute fast-registration, then in that case, we probably need some way of cascading the resolver down the dynamic chain right?
I understand by definition at the moment TaskResolver is bound to a Task. but in the case of dynamic task, would it automatically cascade downstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think about this more. Can I do this as a separate PR? I'll write up an issue... also because fast-register doesn't work right now with dynamic tasks.

"--inputs",
"{{.input}}",
"--output-prefix",
"{{.outputPrefix}}",
"--raw-output-data-prefix",
"{{.rawOutputDataPrefix}}",
"--resolver",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this implies that the resolver has to be the last arg only?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use multi-valued args?
--resolver-arg "x=y" --resolver-arg "..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean. the resolver can be specified anywhere, but the resolver args have to be at the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @kumare3 means an alternative syntax is
--resolver ... --resolver-arg "task-module=<TASK_MODULE>" -- resolver-arg "task-name=<TASK_NAME>"

Where --resolver-arg is a click option instead of an argument.

Will the resolver always have 2 args, task-module and task-name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just throwing out another idea:

--resolver ... --resolver-task-module ... --resolver task-name ...

"--inputs",
"{{.input}}",
"--output-prefix",
"{{.outputPrefix}}",
"--raw-output-data-prefix",
"{{.rawOutputDataPrefix}}",
"--resolver",
self.task_resolver.location,
"--",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i really prefer --resolver-arg type of usage, why do you want this raw command line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent half an hour getting it to work with click and gave up?

@kumare3
Copy link
Contributor

kumare3 commented Mar 11, 2021

This looks great to me, couple nits

flytekit/core/python_auto_container.py Outdated Show resolved Hide resolved
flytekit/core/python_auto_container.py Outdated Show resolved Hide resolved

task_module = importlib.import_module(task_module)
task_def = getattr(task_module, task_name)
return task_def
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task_def and task_module vars only used in once, consider return getattr(importlib.import_module(task_module), task_name)

"""
return func.__code__.co_flags & inspect.CO_NESTED != 0

container_args = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do

return [
    "pyflyte-execute",
    ...
    "--",
    *self.task_resolver.loader_args(settings, self),
]

@@ -244,20 +144,23 @@ def execute(self, **kwargs) -> Any:
return self.dynamic_execute(self._task_function, **kwargs)

def get_command(self, settings: SerializationSettings) -> List[str]:
return [
container_args = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do

return [
    "pyflyte-execute",
    ...
    "--",
    *self.task_resolver.loader_args(settings, self),
]

flytekit/core/map_task.py Show resolved Hide resolved
)


class TaskResolverMixin(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python 3+: object is default parent class

Suggested change
class TaskResolverMixin(object):
class TaskResolverMixin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like keeping it though, makes it explicit and it's everywhere else.

flytekit/core/context_manager.py Show resolved Hide resolved
"--inputs",
"{{.input}}",
"--output-prefix",
"{{.outputPrefix}}",
"--raw-output-data-prefix",
"{{.rawOutputDataPrefix}}",
"--resolver",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @kumare3 means an alternative syntax is
--resolver ... --resolver-arg "task-module=<TASK_MODULE>" -- resolver-arg "task-name=<TASK_NAME>"

Where --resolver-arg is a click option instead of an argument.

Will the resolver always have 2 args, task-module and task-name?

wild-endeavor and others added 5 commits March 11, 2021 10:17
Signed-off-by: wild-endeavor <[email protected]>
Signed-off-by: wild-endeavor <[email protected]>
@wild-endeavor
Copy link
Contributor Author

Just to answer the above - the resolver arg is not an option yeah, it's a click arg. The number of arguments is unknown except that we require it to be >=1. it could be just one or someone may make a new resolver a year from now that has a hundred args.

@wild-endeavor wild-endeavor merged commit ce42886 into master Mar 11, 2021
@v01dXYZ
Copy link

v01dXYZ commented Mar 14, 2021

Would it be useful to add a warning about cloudpickle ? It's not a robust way for serializing Python classes/functions compared to textual/bytecode code (or a package). Indeed, only the objects in __main__ would be serialized, which means if an user wants to gather helper functions in a utils.py that won't work anymore + the warnings from cloudpickle docs (https://github.com/cloudpipe/cloudpickle)

cloudpickle is excellent if we want to send functions from a REPL to a cluster but not with dealing with a modular source tree.

max-hoffman pushed a commit to dolthub/flytekit that referenced this pull request May 11, 2021
@kumare3
Copy link
Contributor

kumare3 commented Oct 10, 2021

Would it be useful to add a warning about cloudpickle ? It's not a robust way for serializing Python classes/functions compared to textual/bytecode code (or a package). Indeed, only the objects in __main__ would be serialized, which means if an user wants to gather helper functions in a utils.py that won't work anymore + the warnings from cloudpickle docs (https://github.com/cloudpipe/cloudpickle)

cloudpickle is excellent if we want to send functions from a REPL to a cluster but not with dealing with a modular source tree.
@v01dXYZ we did not see this, I do agree. that is why Flytekit does not really recommend or support cloudpickle in any meaningful way. but we would love to talk more, would be open to discuss in slack?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants