Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate different ways to pass parameters between tasks #41

Closed
Tomcli opened this issue Mar 19, 2020 · 13 comments · Fixed by #75
Closed

Investigate different ways to pass parameters between tasks #41

Tomcli opened this issue Mar 19, 2020 · 13 comments · Fixed by #75

Comments

@Tomcli
Copy link
Member

Tomcli commented Mar 19, 2020

Currently, we are using the task.results for passing parameter outputs. This has a limitation where the output parameter files have to be under /tekton/results. However, some of the Kubeflow pipeline has no configuration for the output file path (e.g. The current Watson ML example). Therefore, we need to figure out an alternative way that can take output parameter files from any path in the container.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.93

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@Tomcli
Copy link
Member Author

Tomcli commented Mar 19, 2020

/p1

@ckadner
Copy link
Member

ckadner commented Mar 19, 2020

FYI, linking to PR #27 which introduced the limited file_outputs parameter passing

PR #27 made this scenario work, where the path of the output (#1) file shows up as string in the arguments (#2) of the ContainerOp:

def gcs_download_op(url):
  return dsl.ContainerOp(
    name='GCS - Download',
    image='google/cloud-sdk:279.0.0',
    command=['sh', '-c'],
    arguments=['gsutil cat $0 | tee $1', url, '/tmp/results.txt'], # 2: output file path
    file_outputs={
      'data': '/tmp/results.txt',  # 1: output parameter mapped to file path
    }
  )

which the KFP-Tekton compiler replaces with the expression $(results.data.path):

spec:
  params:
  - name: url1
  results:
  - name: data              # 1: result parameter definition
    description: /tmp/results.txt
  steps:
  - name: gcs-download
    image: google/cloud-sdk:279.0.0
    command: ['sh', '-c']
    args:
    - gsutil cat $0 | tee $1
    - $(inputs.params.url1)
    - $(results.data.path)  # 2: replaced file path with result parameter expression

However, if the output file location is dynamically provided as a input parameter or is simply known by the developer and does not show up in clear text, then the above logic fails:

def TrainOp(name, input_dir, output_dir, model_name, model_version, epochs):
  return dsl.ContainerOp(
    name=name,
    image='<train-image>',
    arguments=[      # 2: there are no output file path to replace with Tekton variable expression
      '--input_dir', input_dir,
      '--output_dir', output_dir,
      '--model_name', model_name,
      '--model_version', model_version,
      '--epochs', epochs
    ],
    file_outputs={'output': '/output.txt'}  # 1: output parameter mapped to file path
  )

@Tomcli
Copy link
Member Author

Tomcli commented Mar 20, 2020

initial proposal, adding an extra step to copy file to /tekton/results

@Tomcli
Copy link
Member Author

Tomcli commented Mar 24, 2020

update: passing file between steps will still require either volumemount or workspaces.
However, both of them still requires to specify a mountpath that's not /. Therefore, it won't be scalable if users store their file in more than one subpath.

@Tomcli
Copy link
Member Author

Tomcli commented Mar 24, 2020

@Tomcli
Copy link
Member Author

Tomcli commented Mar 24, 2020

right now we don't have code-gen for pipelinerun. So creating workspaces will require users to manually define their workspace volumes. Should we propose to also generate pipelinerun as part of our yaml?

@ckadner
Copy link
Member

ckadner commented Mar 24, 2020

@afrittoli -- can you clarify if we really need a Workspace (and a PipelineRun to mount the volume) here? We are just trying to add an additional step to those tasks which have output parameters which need to be moved to the default /tekton/results folder

@ckadner
Copy link
Member

ckadner commented Mar 27, 2020

/kind discussion
/remove-kind feature

@Tomcli
Copy link
Member Author

Tomcli commented Mar 27, 2020

Added initial approach to copy the files with an extra step. @afrittoli let me know if you could think of a better way to pass parameters in any path.

@afrittoli
Copy link
Contributor

update: passing file between steps will still require either volumemount or workspaces.
However, both of them still requires to specify a mountpath that's not /. Therefore, it won't be scalable if users store their file in more than one subpath.

For sharing between steps you can mount an emptyDir, however that would not work for sharing across tasks. You could use an empty dir that mounts to /kfp and transform user provided paths to be relative to that i.e. /tmp/output.txt would be /kfp/tmp/output.txt in the Tekton YAML.

@afrittoli
Copy link
Contributor

@afrittoli -- can you clarify if we really need a Workspace (and a PipelineRun to mount the volume) here? We are just trying to add an additional step to those tasks which have output parameters which need to be moved to the default /tekton/results folder

As long as you're within the boundaries of a Task, you don't need a Workspace.
Workspaces are an useful abstraction that allows you to avoid embeeding the name of a PVC into the static definition of a Task/Pipeline - so that you can bind any PVC to the workspace at runtime. You don't need workspaces for file sharing between steps, you can use the node filesystem for that, mount an emtpy dir to any path you like (except /tekton, which is reserved).

@afrittoli
Copy link
Contributor

As a side not, I feel that allowing absolute paths into the KFP DSL is a leak of a lower level detail into the DSL abstraction, but I guess it might be too late to change that?
As a data scientist / machine learning engineer, I would like to create Pipeline operations with named outputs - how this outputs are stored into file within the containers is not something I should worry about. The only file interface should be when I want to export data to a permanent storage so that it's available outside of the pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants