Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with reading 32 bit blobs #995

Merged
merged 22 commits into from
Feb 18, 2022

Conversation

A-Baji
Copy link
Collaborator

@A-Baji A-Baji commented Feb 15, 2022

No description provided.

@A-Baji A-Baji linked an issue Feb 15, 2022 that may be closed by this pull request
dtype = 'uint32' if self.is_32_bit else 'uint64'
try:
data = np.frombuffer(self._blob, dtype=dtype, count=count, offset=self._pos)
except ValueError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be a reliable mechanisms for determining the decoding. The serialization should be explicit with respect to how the encoding is performed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimitri-yatsenko Yes, we discussed that but the issue is that including the precision into the serialization scheme (e.g. dj064 or mym32) is an update that isn't backward compatible. It would work point forward but wouldn't help us for prior blobs which is precisely where there is likely to be an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python implementation is deterministic. The serialization does not depend on whether the platform is 32- or 64-bit.

The problem is that the above solution will not catch all errors. Some cases will not raise the ValueError. We should devise the solution to catch those cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real problem is the mym implementation depended on the platform. We should explicitly catch this case.

Copy link
Collaborator

@guzman-raphael guzman-raphael Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to find an approach with a 'fall-back' mechanism that can capture the vast majority of cases without introducing performance overhead for normal unpacking. This was chosen since one of our customers has an enormous amount of data that would need to be migrated otherwise (~100M records).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, for all numeric arrays, the datatype is fully determined. mym underspecified the integer length and it defaulted to the platform default. That was a problem with the matlab implementation only. Correct? If so, we can have a more targeted fix.

@dimitri-yatsenko
Copy link
Member

We should not rely on the value error to determine the encoding of array dimensions. This issue is limited to a specific case of old blobs created by a version of mYm that datajoint itself never used. Therefore, please make this flag explicit rather than automatic. When working with these outdated blobs, the user will need to set

dj.blob.use_32bit_dims = True

@@ -43,6 +43,7 @@
}

bypass_serialization = False # runtime setting to bypass blob (en|de)code
use_32bit_dims = False # runtime setting to read data as 32-bit
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use_32bit_dims = False # runtime setting to read data as 32-bit
# runtime setting to read integers as 32-bit to read blobs created by the 32-bit
# version of the mYm library for MATLAB
use_32bit_dims = False

@A-Baji
Copy link
Collaborator Author

A-Baji commented Feb 17, 2022

Is there a better way of wording the change log that you would prefer? @dimitri-yatsenko

CHANGELOG.md Outdated
@@ -1,7 +1,7 @@
## Release notes

### 0.13.4 -- TBA
* Bugfid - Fix error when fetching data that was inserted as 32-bit
* Bugfix - Added a module flag for reading 32-bit data
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Bugfix - Added a module flag for reading 32-bit data
* Bugfix - Allow reading blobs produced by legacy 32-bit compiled mYm library for matlab. PR #995

@@ -1,6 +1,6 @@
0.13.4 -- TBA
----------------------
* Bugfix - Fix error when fetching data that was inserted as 32-bit
* Bugfix - Added a module flag for reading 32-bit data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Bugfix - Added a module flag for reading 32-bit data
* Bugfix - Allow reading blobs produced by legacy 32-bit compiled mYm library for matlab. PR #995

Copy link
Collaborator

@guzman-raphael guzman-raphael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@A-Baji 👏 Fanstastic job on a tricky one! 👏

@dimitri-yatsenko dimitri-yatsenko merged commit 5364981 into datajoint:master Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when fetching 32-bit mym-serialized blob data
3 participants