Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot convert data of type str from v3 to v4 using the convert function #2981

Open
1 task done
CorneliaMelon opened this issue Nov 4, 2024 · 9 comments
Open
1 task done
Assignees
Labels
bug Something isn't working

Comments

@CorneliaMelon
Copy link

Severity

P0 - Critical breaking issue or missing functionality

Current Behavior

I am trying to convert my datasets from v3 to v4 with different columns of different data types. My 'instruction' column holds strings of different lengths and I get the error:

File "/opt/miniconda3/lib/python3.11/site-packages/deeplake/init.py", line 164, in convert
dest_ds.append(b)

deeplake._deeplake.InvalidColumnValueError: Invalid value for column 'instruction'. Reason - 'Data must have 2 dimensions provided 1'

Interestingly, when I explicitly append the instruction column in my own script, it works.

def convert_raw(target):
    dest_ds = deeplake.create(target)
    dest_ds.add_column("instruction", str)
    source_ds = deeplake.query(
        'select instruction from "/Users/..."')
    dest_ds.append(source_ds)
    dest_ds.commit()

If I use the convert function only on the instruction column, I also get an error:

def convert_raw(target):
    dest_ds = deeplake.create(target)
    dest_ds.add_column("instruction", str)
    source_ds = deeplake.query(
        'select instruction from "/Users/..."')
    print("Source size: ", len(source_ds))
    convert(source_ds, dest_ds)

Terminal output about source_ds with breakpoint at the print statement and error message:
source_ds
PyDev console: starting.
Dataset(columns=(instruction), length=2013)
Source size: 2013
convert(source_ds, dest_ds)
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "", line 1, in
File "/Users/...", line 9, in convert
deeplake.convert(source, target)
File "/opt/miniconda3/lib/python3.11/site-packages/deeplake/init.py", line 156, in convert
source_ds = deeplake.query(f'select * from "{src}"')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The query source - 'Dataset(columns=(instruction), length=2013)', is not found or not supported.

If I query the whole dataset and only get the instruction column, I also get an error:

def convert_raw(target):
    dest_ds = deeplake.create(target)
    dest_ds.add_column("instruction", str)
    source_ds = deeplake.query(
        'select * from "/Users/..."')
    source_ds = source_ds['instruction']
    convert(source_ds, dest_ds)

Error message:
Process finished with exit code 139 (interrupted by signal 11:SIGSEGV)


Is there any way I can still use the automatic deeplake.convert(src='al://org_name/existing_v3_dataset', dst='al://org_name/new_v4_dataset')?

Steps to Reproduce

To reproduce the errors, see code in the problem description above.

Expected/Desired Behavior

Dataset being converted from v3 to v4.

Python Version

Python 3.11.8

OS

No response

IDE

No response

Packages

No response

Additional Context

No response

Possible Solution

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR (Thank you!)
@CorneliaMelon CorneliaMelon added the bug Something isn't working label Nov 4, 2024
@davidbuniat
Copy link
Member

Hey @CorneliaMelon, sorry about experiencing the problem and thanks for sharing the error. Can you please share source_ds.summary() to better understand the structure of the original dataset, its columns and the types used?

@CorneliaMelon
Copy link
Author

Hey @davidbuniat thanks for the quick response!

source_ds.summary()
Dataset(columns=(obs/joint_positions,obs/joint_velocities,obs/front_camera,obs/front_right_camera,obs/front_left_camera,obs/prev_action,obs/qpos,obs/qvel,action,timings/iter_start_time,timings/data_append_time,timings/env_step_time,reset_before_step,instruction,teleop,in_correction), length=2111)
+------------------------+------------------------------------------+
| column | type |
+------------------------+------------------------------------------+
| obs/joint_positions | array(dtype=float64, shape=[None]) |
+------------------------+------------------------------------------+
| obs/joint_velocities | array(dtype=float64, shape=[None]) |
+------------------------+------------------------------------------+
| obs/front_camera |array(dtype=uint8, shape=[None,None,None])|
+------------------------+------------------------------------------+
| obs/front_right_camera |array(dtype=uint8, shape=[None,None,None])|
+------------------------+------------------------------------------+
| obs/front_left_camera |array(dtype=uint8, shape=[None,None,None])|
+------------------------+------------------------------------------+
| obs/prev_action | array(dtype=float64, shape=[None]) |
+------------------------+------------------------------------------+
| obs/qpos | array(dtype=float64, shape=[None]) |
+------------------------+------------------------------------------+
| obs/qvel | array(dtype=float64, shape=[None]) |
+------------------------+------------------------------------------+
| action | array(dtype=float32, shape=[None]) |
+------------------------+------------------------------------------+
|timings/iter_start_time | array(dtype=float64, shape=[None]) |
+------------------------+------------------------------------------+
|timings/data_append_time| array(dtype=float64, shape=[None]) |
+------------------------+------------------------------------------+
| timings/env_step_time | array(dtype=float64, shape=[None]) |
+------------------------+------------------------------------------+
| reset_before_step | array(dtype=int64, shape=[None]) |
+------------------------+------------------------------------------+
| instruction | array(dtype=text, shape=[None]) |
+------------------------+------------------------------------------+
| teleop | array(dtype=int64, shape=[None]) |
+------------------------+------------------------------------------+
| in_correction | array(dtype=int64, shape=[None]) |
+------------------------+------------------------------------------+

@khustup2
Copy link
Contributor

khustup2 commented Nov 5, 2024

Hey @CorneliaMelon , thanks for reporting this. We just released deeplake==4.0.1 which fixes the issue with text column conversion from v3 to v4. Can you please try and let me know if your issue with deeplake.convert is fixed? Please ping here if you still face any issues. Thanks!

@CorneliaMelon
Copy link
Author

@khustup2 instruction works, now I get the error:
deeplake.convert(source, target)
File "/opt/miniconda3/lib/python3.11/site-packages/deeplake/init.py", line 169, in convert
dest_ds.append(b)
deeplake._deeplake.InvalidColumnValueError: Invalid value for column 'in_correction'. Reason - 'Data must have 1 dimensions provided 2'

@khustup2
Copy link
Contributor

khustup2 commented Nov 6, 2024

@CorneliaMelon any chance you can provide dataset summary on V3? In order to do that, can you please install deeplake==3.9.27 open the source dataset and send the output of ds.summary()? Thanks for your patience and cooperation!

@CorneliaMelon
Copy link
Author

@khustup2 here it is:

ds.summary()
Dataset(path='/Users...', tensors=['action', 'in_correction', 'instruction', 'obs/front_camera', 'obs/front_left_camera', 'obs/front_right_camera', 'obs/joint_positions', 'obs/joint_velocities', 'obs/prev_action', 'obs/qpos', 'obs/qvel', 'reset_before_step', 'teleop', 'timings/data_append_time', 'timings/env_step_time', 'timings/iter_start_time'])

      tensor            htype          shape          dtype  compression
     -------           -------        -------        -------  ------- 
      action           generic       (2111, 6)       float32   None   
  in_correction        generic       (2111, 1)        int64    None   
   instruction          text         (2111, 1)         str     None   
 obs/front_camera       image   (2111, 512, 512, 3)   uint8    jpeg   

obs/front_left_camera image (2111, 512, 512, 3) uint8 jpeg
obs/front_right_camera image (2111, 512, 512, 3) uint8 jpeg
obs/joint_positions generic (2111, 6) float64 None
obs/joint_velocities generic (2111, 6) float64 None
obs/prev_action generic (2111, 7) float64 None
obs/qpos generic (2111, 15) float64 None
obs/qvel generic (2111, 14) float64 None
reset_before_step generic (2111, 1) int64 None
teleop generic (2111, 1) int64 None
timings/data_append_time generic (2111, 1) float64 None
timings/env_step_time generic (2111, 1) float64 None
timings/iter_start_time generic (2111, 1) float64 None

@khustup2
Copy link
Contributor

khustup2 commented Nov 7, 2024

@CorneliaMelon thanks for the info! We have this fixed locally and will include it in the upcoming 4.0.2 release. I will let you know once the release is done.

@CorneliaMelon
Copy link
Author

@khustup2 Awesome, thanks for the update! Looking forward to it.

@khustup2
Copy link
Contributor

Hey @CorneliaMelon , we released deeplake==4.0.2 which addresses the issue you faced with scalar column as well as fixes other conversion issues. Hopefully this release fixes all the issues you faced with conversion. Please let me know if you can successfully convert your dataset to v4 or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants