Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to build kustomize for mnist example #681

Closed
ryandawsonuk opened this issue Nov 21, 2019 · 21 comments
Closed

unable to build kustomize for mnist example #681

ryandawsonuk opened this issue Nov 21, 2019 · 21 comments

Comments

@ryandawsonuk
Copy link

I'm trying to follow the mnist example with the local storage steps. I've tried to follow those steps but when I do kustomize build . then I get:

no matches for OriginalId kubeflow.org_v1beta2_TFJob|~X|$(trainingName); no matches for CurrentId kubeflow.org_v1beta2_TFJob|~X|$(trainingName); failed to find unique target for patch kubeflow.org_v1beta2_TFJob|$(trainingName

I've tried with kustomize v3 (go get -u sigs.k8s.io/kustomize/kustomize/v3) and v2 (go get -u sigs.k8s.io/kustomize/kustomize/v2) but I get the same error with both. I am running from the training/local directory (have also tried the GCS one and get the same error).

I'm not able to get as far as #672 as I can't get the kustomize build . step to complete.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.61. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@fenglixa
Copy link
Member

kustomize 2.0.3 is required and mentioned in minist example document

Seems other version of kustomize has issues on it. I remember such issues was logged and closed before

@ryandawsonuk
Copy link
Author

ryandawsonuk commented Nov 22, 2019

Oh I hadn't noticed that. But this time I tried with go get sigs.k8s.io/kustomize/kustomize/[email protected] and I get the same error :( I removed the existing kustomize version first and did a which kustomize to check. Actually it's a different error from the one referenced in the doc

@ryandawsonuk
Copy link
Author

ryandawsonuk commented Nov 22, 2019

I've tried to deduce what the kustomize should evaluate to but I keep getting a path or format wrong. Would you be able to share an example?

@fenglixa
Copy link
Member

Here is the output(sucessful example) from myside after run "kustomize build ."

apiVersion: v1
data:
  batchSize: "100"
  exportDir: /mnt/export
  learningRate: "0.02"
  modelDir: /mnt
  name: tfjob-021
  pvcMountPath: /mnt
  pvcName: fengpvc
  trainSteps: "200"
kind: ConfigMap
metadata:
  name: mnist-map-training-4t25c985bg
---
apiVersion: kubeflow.org/v1beta2
kind: TFJob
metadata:
  name: tfjob-021
  namespace: kubeflow
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - /usr/bin/python
            - /opt/model.py
            - --tf-model-dir=$(modelDir)
            - --tf-export-dir=$(exportDir)
            - --tf-train-steps=$(trainSteps)
            - --tf-batch-size=$(batchSize)
            - --tf-learning-rate=$(learningRate)
            env:
            - name: modelDir
              value: /mnt
            - name: exportDir
              value: /mnt/export
            - name: trainSteps
              value: "200"
            - name: batchSize
              value: "100"
            - name: learningRate
              value: "0.02"
            image: docker.io/fenglixa/mytfmodel:tag
            name: tensorflow
            volumeMounts:
            - mountPath: /mnt
              name: local-storage
            workingDir: /opt
          restartPolicy: OnFailure
          volumes:
          - name: local-storage
            persistentVolumeClaim:
              claimName: fengpvc
    Ps:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - /usr/bin/python
            - /opt/model.py
            - --tf-model-dir=$(modelDir)
            - --tf-export-dir=$(exportDir)
            - --tf-train-steps=$(trainSteps)
            - --tf-batch-size=$(batchSize)
            - --tf-learning-rate=$(learningRate)
            env:
            - name: modelDir
              value: /mnt
            - name: exportDir
              value: /mnt/export
            - name: trainSteps
              value: "200"
            - name: batchSize
              value: "100"
            - name: learningRate
              value: "0.02"
            image: docker.io/fenglixa/mytfmodel:tag
            name: tensorflow
            volumeMounts:
            - mountPath: /mnt
              name: local-storage
            workingDir: /opt
          restartPolicy: OnFailure
          volumes:
          - name: local-storage
            persistentVolumeClaim:
              claimName: fengpvc
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - command:
            - /usr/bin/python
            - /opt/model.py
            - --tf-model-dir=$(modelDir)
            - --tf-export-dir=$(exportDir)
            - --tf-train-steps=$(trainSteps)
            - --tf-batch-size=$(batchSize)
            - --tf-learning-rate=$(learningRate)
            env:
            - name: modelDir
              value: /mnt
            - name: exportDir
              value: /mnt/export
            - name: trainSteps
              value: "200"
            - name: batchSize
              value: "100"
            - name: learningRate
              value: "0.02"
            image: docker.io/fenglixa/mytfmodel:tag
            name: tensorflow
            volumeMounts:
            - mountPath: /mnt
              name: local-storage
            workingDir: /opt
          restartPolicy: OnFailure
          volumes:
          - name: local-storage
            persistentVolumeClaim:
              claimName: fengpvc

@fenglixa
Copy link
Member

Issue #609
Should be same issue.

@ryandawsonuk
Copy link
Author

After editing that yaml I was able to run the TFJob.

@plaffitte
Copy link

@ryandawsonuk How exactly did you solve this? I can't get past the issue with the TFJob version...

@ryandawsonuk
Copy link
Author

@plaffitte I didn't get the kustomize working yet, I just modified the yaml that @fenglixa provided above to use the v1 format - SeldonIO/seldon-core#1106 (comment)

@plaffitte
Copy link

plaffitte commented Nov 28, 2019

I get the following error:

unable to recognize "config.yaml": no matches for kind "TFJob" in version "kubeflow.org/v1beta2"

My file looks like this:

data:
  batchSize: "100"
  exportDir: /mnt/export
  learningRate: "0.01"
  modelDir: /mnt
  name: mnist-train-local
  pvcMountPath: /mnt
  pvcName: mnist-test
  trainSteps: "200"
kind: ConfigMap
metadata:
  name: mnist-map-training-kcc7dkhf4b
---
apiVersion: kubeflow.org/v1beta2
kind: TFJob
metadata:
  name: mnist-train-local
  namespace: kubeflow
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - /usr/bin/python
            - /opt/model.py
            - --tf-model-dir=$(modelDir)
            - --tf-export-dir=$(exportDir)
            - --tf-train-steps=$(trainSteps)
            - --tf-batch-size=$(batchSize)
            - --tf-learning-rate=$(learningRate)
            env:
            - name: modelDir
              value: /mnt
            - name: exportDir
              value: /mnt/export
            - name: trainSteps
              value: "200"
            - name: batchSize
              value: "100"
            - name: learningRate
              value: "0.01"
            image: docker.io/pierremoodagent/mytfmodel:test
            name: tensorflow
            volumeMounts:
            - mountPath: /mnt
              name: local-storage
            workingDir: /opt
          restartPolicy: OnFailure
          volumes:
          - name: local-storage
            persistentVolumeClaim:
              claimName: mnist-test
    Ps:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - /usr/bin/python
            - /opt/model.py
            - --tf-model-dir=$(modelDir)
            - --tf-export-dir=$(exportDir)
            - --tf-train-steps=$(trainSteps)
            - --tf-batch-size=$(batchSize)
            - --tf-learning-rate=$(learningRate)
            env:
            - name: modelDir
              value: /mnt
            - name: exportDir
              value: /mnt/export
            - name: trainSteps
              value: "200"
            - name: batchSize
              value: "100"
            - name: learningRate
              value: "0.01"
            image: docker.io/pierremoodagent/mytfmodel:test
            name: tensorflow
            volumeMounts:
            - mountPath: /mnt
              name: local-storage
            workingDir: /opt
          restartPolicy: OnFailure
          volumes:
          - name: local-storage
            persistentVolumeClaim:
              claimName: mnist-test
    Worker:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - /usr/bin/python
            - /opt/model.py
            - --tf-model-dir=$(modelDir)
            - --tf-export-dir=$(exportDir)
            - --tf-train-steps=$(trainSteps)
            - --tf-batch-size=$(batchSize)
            - --tf-learning-rate=$(learningRate)
            env:
            - name: modelDir
              value: /mnt
            - name: exportDir
              value: /mnt/export
            - name: trainSteps
              value: "200"
            - name: batchSize
              value: "100"
            - name: learningRate
              value: "0.01"
            image: docker.io/pierremoodagent/mytfmodel:test
            name: tensorflow
            volumeMounts:
            - mountPath: /mnt
              name: local-storage
            workingDir: /opt
          restartPolicy: OnFailure
          volumes:
          - name: local-storage
            persistentVolumeClaim:
              claimName: mnist-test

@ryandawsonuk
Copy link
Author

Yeah the TFJob version in kubeflow now is

apiVersion: kubeflow.org/v1
kind: TFJob

@plaffitte
Copy link

Oops, sorry. I actually tried both and failed but copy-pasted the wrong one...
Here's the error I get:

unable to recognize "config.yaml": no matches for kind "TFJob" in version "kubeflow.org/v1"

And my file:

data:
  batchSize: "100"
  exportDir: /mnt/export
  learningRate: "0.01"
  modelDir: /mnt
  name: mnist-train-local
  pvcMountPath: /mnt
  pvcName: mnist-test
  trainSteps: "200"
kind: ConfigMap
metadata:
  name: mnist-map-training-kcc7dkhf4b
---
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: mnist-train-local
  namespace: kubeflow
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - /usr/bin/python
            - /opt/model.py
            - --tf-model-dir=$(modelDir)
            - --tf-export-dir=$(exportDir)
            - --tf-train-steps=$(trainSteps)
            - --tf-batch-size=$(batchSize)
            - --tf-learning-rate=$(learningRate)
            env:
            - name: modelDir
              value: /mnt
            - name: exportDir
              value: /mnt/export
            - name: trainSteps
              value: "200"
            - name: batchSize
              value: "100"
            - name: learningRate
              value: "0.01"
            image: docker.io/pierremoodagent/mytfmodel:test
            name: tensorflow
            volumeMounts:
            - mountPath: /mnt
              name: local-storage
            workingDir: /opt
          restartPolicy: OnFailure
          volumes:
          - name: local-storage
            persistentVolumeClaim:
              claimName: mnist-test
    Ps:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - /usr/bin/python
            - /opt/model.py
            - --tf-model-dir=$(modelDir)
            - --tf-export-dir=$(exportDir)
            - --tf-train-steps=$(trainSteps)
            - --tf-batch-size=$(batchSize)
            - --tf-learning-rate=$(learningRate)
            env:
            - name: modelDir
              value: /mnt
            - name: exportDir
              value: /mnt/export
            - name: trainSteps
              value: "200"
            - name: batchSize
              value: "100"
            - name: learningRate
              value: "0.01"
            image: docker.io/pierremoodagent/mytfmodel:test
            name: tensorflow
            volumeMounts:
            - mountPath: /mnt
              name: local-storage
            workingDir: /opt
          restartPolicy: OnFailure
          volumes:
          - name: local-storage
            persistentVolumeClaim:
              claimName: mnist-test
    Worker:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - /usr/bin/python
            - /opt/model.py
            - --tf-model-dir=$(modelDir)
            - --tf-export-dir=$(exportDir)
            - --tf-train-steps=$(trainSteps)
            - --tf-batch-size=$(batchSize)
            - --tf-learning-rate=$(learningRate)
            env:
            - name: modelDir
              value: /mnt
            - name: exportDir
              value: /mnt/export
            - name: trainSteps
              value: "200"
            - name: batchSize
              value: "100"
            - name: learningRate
              value: "0.01"
            image: docker.io/pierremoodagent/mytfmodel:test
            name: tensorflow
            volumeMounts:
            - mountPath: /mnt
              name: local-storage
            workingDir: /opt
          restartPolicy: OnFailure
          volumes:
          - name: local-storage
            persistentVolumeClaim:
              claimName: mnist-test

@ryandawsonuk
Copy link
Author

What does kubectl get crd tfjobs.kubeflow.org -o yaml return for you?

@plaffitte
Copy link

It returns Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "tfjobs.kubeflow.org" not found

@ryandawsonuk
Copy link
Author

Then you need to install the CRD. Did you do a kfctl install of kubeflow?

@janeman98
Copy link
Contributor

I still get error (but different from v3.4.0) when using v2.0.3:

kustomize version
Version: {KustomizeVersion:2.0.3 GitCommit:a6f65144121d1955266b0cd836ce954c04122dc8 BuildDate:2019-03-05T20:37:42Z GoOs:linux GoArch:amd64}

kustomize build . |kubectl apply -f -
Error: couldn't find target kubeflow.org_v1beta2_TFJob|~X|~P|$(trainingName)|~S for json patch
error: no objects passed to apply

@jtfogarty
Copy link

jtfogarty commented Jan 8, 2020

/area example

@k8s-ci-robot
Copy link
Contributor

@jtfogarty: The label(s) area/kustomize cannot be applied, because the repository doesn't have them

In response to this:

/area kustomize

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jlewi
Copy link
Contributor

jlewi commented Feb 11, 2020

The version should probably be v1.

@stale
Copy link

stale bot commented May 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/tfjob 0.71

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@stale stale bot closed this as completed May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants