build_image_dataset.py crashes for delf training #10474

avidullu · 2022-01-27T22:19:03Z

Running instructions from https://github.com/tensorflow/models/blob/master/research/delf/delf/python/training/README.md#prepare-the-data-for-training without any GPUs on a Google Cloud VM encounters an error.

Below is the command with the error
python3 build_image_dataset.py --train_csv_path=$LANDMARK_DATA/train/train.csv --train_clean_csv_path=$LANDMARK_DATA/train/train_clean.csv --train_directory=$LANDMARK_DATA/train//// --output_directory=$LANDMARK_DATA/tfrecord/ --num_shards=128 --generate_train_validation_splits --validation_split_size=0.2 --test_csv_path=$LANDMARK_DATA/train/test.csv --test_directory=$LANDMARK_DATA/test////
2022-01-27 22:07:03.277440: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-01-27 22:07:03.277495: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-01-27 22:07:05.252430: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-01-27 22:07:05.252502: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-01-27 22:07:05.252525: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (gcsfuse-experiment): /proc/driver/nvidia/version does not exist
/home/avidullu/mldata/cvdfoundation/google-landmark/train/train_clean.csv
Traceback (most recent call last):
File "build_image_dataset.py", line 491, in
app.run(main)
File "/home/avidullu/.local/lib/python3.7/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/avidullu/.local/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "build_image_dataset.py", line 485, in main
FLAGS.seed)
File "build_image_dataset.py", line 439, in _build_train_tfrecord_dataset
image_dir)
File "build_image_dataset.py", line 144, in _get_clean_train_image_files_and_labels
df = pd.read_csv(csv_file)
File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 482, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 811, in init
self._engine = self._make_engine(self.engine)
File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in init
self._open_handles(src, kwds)
File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py", line 229, in _open_handles
errors=kwds.get("encoding_errors", "strict"),
File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/common.py", line 724, in get_handle
newline="",
AttributeError: 'GFile' object has no attribute 'readable'

models/research/delf/delf/python/training/build_image_dataset.py

Line 143 in a033df7

with tf.io.gfile.GFile(csv_path, 'rb') as csv_file:

seems to be using a binary mode for read. On removing the 'b' from here and L116 the script makes progress.

tensorflowbutler · 2022-01-29T04:12:35Z

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
What is the top-level directory of the model you are using
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

avidullu · 2022-01-30T07:13:46Z

Responding inline

Q. What is the top-level directory of the model you are using
A. models/research/delf

Q. Have I written custom code?
A. No. Ran the commands as mentioned in research/delf/delf/python/training/README.md

Q. OS Platform and Distribution
A. GCP VM with Debian image. 32 CPU and 128GB RAM

Q. TensorFlow installed from?
A. From the script mentioned in /research/delf/INSTALL_INSTRUCTIONS.md

Q. TensorFlow version
A. 2.7

Q. Bazel version
A. 5.0

Q. CUDA/chDNN version
A. NA (no GPU on hardware)

Q. GPU model
A. NA (no GPU on hardware)

Q Exact command to reproduce
A. python3 build_image_dataset.py --train_csv_path=$LANDMARK_DATA/train/train.csv --train_clean_csv_path=$LANDMARK_DATA/train/train_clean.csv --train_directory=$LANDMARK_DATA/train/ --output_directory=$LANDMARK_DATA/tfrecord/ --num_shards=128 --generate_train_validation_splits --validation_split_size=0.2 --test_csv_path=$LANDMARK_DATA/train/test.csv --test_directory=$LANDMARK_DATA/test/

andrefaraujo · 2022-02-03T16:18:07Z

Thanks for reporting this!

@dan-anghel , I think you wrote this part of the code. Do you remember if the 'b' is really necessary when reading the CSV? What @avidullu reported makes sense to me, although I remember you had run the code several times, which seems to contradict it.

khatchad · 2024-08-07T14:28:02Z

@andrefaraujo

Thanks for reporting this!

@dan-anghel , I think you wrote this part of the code. Do you remember if the 'b' is really necessary when reading the CSV?

The CSV shouldn't be binary. Removing b solves the problem for me.

CSVs should be text files.

tensorflowbutler added the stat:awaiting response Waiting on input from the contributor label Jan 29, 2022

kumariko removed the stat:awaiting response Waiting on input from the contributor label Jan 31, 2022

kumariko self-assigned this Jan 31, 2022

kumariko added models:research models that come under research directory type:bug Bug in the code labels Feb 1, 2022

kumariko assigned andrefaraujo and unassigned kumariko Feb 2, 2022

andrefaraujo assigned andrefaraujo and unassigned andrefaraujo Feb 3, 2022

khatchad added a commit to ponder-lab/models that referenced this issue Aug 7, 2024

Workaround tensorflow#10474.

9052f44

khatchad added a commit to ponder-lab/models that referenced this issue Aug 7, 2024

Fix tensorflow#10474.

03aa7c9

CSVs should be text files.

khatchad linked a pull request Aug 7, 2024 that will close this issue

DELF: Fix CSV read error #11249

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build_image_dataset.py crashes for delf training #10474

build_image_dataset.py crashes for delf training #10474

avidullu commented Jan 27, 2022

tensorflowbutler commented Jan 29, 2022

avidullu commented Jan 30, 2022

andrefaraujo commented Feb 3, 2022

khatchad commented Aug 7, 2024

build_image_dataset.py crashes for delf training #10474

build_image_dataset.py crashes for delf training #10474

Comments

avidullu commented Jan 27, 2022

tensorflowbutler commented Jan 29, 2022

avidullu commented Jan 30, 2022

andrefaraujo commented Feb 3, 2022

khatchad commented Aug 7, 2024