Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeflow-pipeline-postsubmit-integration-test failure #5007

Closed
hilcj opened this issue Jan 19, 2021 · 13 comments · Fixed by #5039 or #5045
Closed

Kubeflow-pipeline-postsubmit-integration-test failure #5007

hilcj opened this issue Jan 19, 2021 · 13 comments · Fixed by #5039 or #5045

Comments

@hilcj
Copy link
Contributor

hilcj commented Jan 19, 2021

KFP Oncall noticed kubeflow-pipeline-postsubmit-integration-test being failing.

/kind bug

@hilcj
Copy link
Contributor Author

hilcj commented Jan 19, 2021

@Bobgy

@Bobgy
Copy link
Contributor

Bobgy commented Jan 20, 2021

@hilcj do you need help from other members?
It's oncall's responsibility to do the initial investigations.

But feel free to delegate to us if you think it's out of your knowledge range

@hilcj
Copy link
Contributor Author

hilcj commented Jan 20, 2021

@Bobgy actually I just want to ask you if this is an known issue. Because the failure has started at least two weeks ago and previous oncalls may have already reported it.

If not, I'll do the investigation and get back to you.

Btw do you know where we keep track of the live issues? Seems the oncalls handover note was not updated since Dec 18, and no update on the kfp oncalls book - live issues since my last oncall in Nov.

@Bobgy
Copy link
Contributor

Bobgy commented Jan 20, 2021

It should be the handover notes, but I guess @Ark-kun and @IronPan didn't take them.

Did you see this problem before?

@chensun chensun self-assigned this Jan 26, 2021
@chensun
Copy link
Member

chensun commented Jan 26, 2021

Error is from dataflow sample test, and this is related to a recent fix I made for dataflow component. Will send a fix shortly.

@chensun
Copy link
Member

chensun commented Jan 26, 2021

Postsubmit is still red with multiple errors. Reopen this, and I'll investigate one by one shortly.

@chensun
Copy link
Member

chensun commented Jan 27, 2021

There're other build failures similar to the one I fixed above. Reopen and I'll make fixes shortly.

@Bobgy
Copy link
Contributor

Bobgy commented Jan 28, 2021

Awesome, thank you @chensun!

@Ark-kun
Copy link
Contributor

Ark-kun commented Jan 28, 2021

JFYI:
The latest issue with the deprecated dataflow component container build was caused by pip 21.0 dropping support for python2. pypa/pip#6148 Those container images were dynamically installing latest version of pip which cause the build to start failing.

@chensun
Copy link
Member

chensun commented Jan 29, 2021

https://oss-prow.knative.dev/view/gs/oss-prow/logs/kubeflow-pipeline-postsubmit-standalone-component-test/1354926169316659200

Latest test error was:

Adding pip 21.0 to easy-install.pth file
Installing pip script to /usr/local/bin
Installing pip2.7 script to /usr/local/bin
Installing pip2 script to /usr/local/bin

Installed /usr/local/lib/python2.7/dist-packages/pip-21.0-py2.7.egg
Processing dependencies for pip
Finished processing dependencies for pip
Traceback (most recent call last):
  File "/usr/local/bin/pip", line 11, in <module>
    load_entry_point('pip==21.0', 'console_scripts', 'pip')()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 561, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2631, in load_entry_point
    return ep.load()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2291, in load
    return self.resolve()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2297, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/local/lib/python2.7/dist-packages/pip-21.0-py2.7.egg/pip/_internal/cli/main.py", line 60
    sys.stderr.write(f"ERROR: {exc}")

and this is due to the content of gs://ml-pipeline/sample-pipeline/xgboost/initialization_actions.sh

#!/bin/bash -e

# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Initialization actions to run in dataproc setup.
# The script will be run on each node in a dataproc cluster.

easy_install pip
pip install tensorflow==1.4.1
pip install pandas==0.18.1

I'm going to update its content to use Python 3.

@chensun
Copy link
Member

chensun commented Jan 30, 2021

After #5062 and updating gs://ml-pipeline/sample-pipeline/xgboost/initialization_actions.sh, the previous error is fixed.

Now we got a runtime error submitting Dataproc spark job:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/ml/util/MLWritable$class
	at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.<init>(XGBoostEstimator.scala:38)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.<init>(XGBoostEstimator.scala:42)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame(XGBoost.scala:182)
	at ml.dmlc.xgboost4j.scala.example.spark.XGBoostTrainer$.main(XGBoostTrainer.scala:120)
	at ml.dmlc.xgboost4j.scala.example.spark.XGBoostTrainer.main(XGBoostTrainer.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.ml.util.MLWritable$class
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 17 more
21/01/30 03:22:56 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@72ab05ed{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
Job output is complete

Guessing we need to update this package [1] to accommodate newer version of Spark that comes with Dataproc 1.5 image.

[1] gs://ml-pipeline/sample-pipeline/xgboost/xgboost4j-example-0.8-SNAPSHOT-jar-with-dependencies.jar

@chensun
Copy link
Member

chensun commented Feb 4, 2021

Opened #5089 to track the XGBoost issue, handing over the rest to @hongye-sun .

@chensun chensun assigned hongye-sun and unassigned chensun Feb 4, 2021
@Bobgy
Copy link
Contributor

Bobgy commented Feb 5, 2021

Postsubmit is now healthy

@Bobgy Bobgy closed this as completed Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment