Suggestions to improve created APIs #53

ml-evs · 2024-03-31T12:26:01Z

I've just been trying to make use of my own dataset through the API here, and have collected a few comments:

improve default rule for generating id field. At the moment we use precisely the relative filepath that the structure came from. I wonder if a better default would be attempting filepath.split("/")[-1].split(".")[0] (i.e., take just the filename). This is not guaranteed to be unique; rather than having some awkward config for specifying this, we could attempt to generate the unique set of IDs from first the filenames themselves, then the first folder up, then second folder up etc. and only fallback to the full file path when required. The immutable_id field (which we dont use atm) can be set by the relative filepath still. I will try experimenting with this...
we should set last_modified to be the datetime of extraction
the database metadata should include the optimade-maker version used

The text was updated successfully, but these errors were encountered:

eimrek · 2024-04-19T16:14:24Z

Regarding first point, i think there is value in having a single predictable ID that reflects the location in the archive file. However, sure, if that's in immutable_id, the we could in principle adopt a cleaner ID scheme.

If i look at the electrides dataset in the client (https://optimadeclient.materialscloud.io/) , on one hand i agree that it would look better if the structures.tar.gz/structures/mp_gga wasn't there for all the entries. But on the other hand, I think there is some value in that (to quickly find the source file, and to see that the calculations were done with gga.)

I could think of other cases where the folder stucture is good to keep as well. Let's say there are structures in two subfolders of a hypothetical entry:

structures.zip/pbe/*.cif
structures.zip/lda/*.cif

What if the two cif file lists are non-overlapping (bit of an edge case, i agree)? then the ids would only contain the filename, but i think the full path would be preferable.

Or some machine-learning dataset:

ml.zip/training_set/*.cif
ml.zip/test_set/*.cif

Probably good to keep the folder structure here as well.

ml-evs · 2024-04-19T16:35:45Z

Regarding first point, i think there is value in having a single predictable ID that reflects the location in the archive file. However, sure, if that's in immutable_id, the we could in principle adopt a cleaner ID scheme.

This is what I've done in #55 for now; we can still discuss whether it is worth merging though. The other motivation here is that right now the property files also have to include the full file path, which ends up looking pretty weird (and its backwards, you have to know how the archive will be laid out once recording the properties...)

If i look at the electrides dataset in the client (https://optimadeclient.materialscloud.io/) , on one hand i agree that it would look better if the structures.tar.gz/structures/mp_gga wasn't there for all the entries. But on the other hand, I think there is some value in that (to quickly find the source file, and to see that the calculations were done with gga.)

I could think of other cases where the folder stucture is good to keep as well. Let's say there are structures in two subfolders of a hypothetical entry:
structures.zip/pbe/*.cif
structures.zip/lda/*.cif
What if the two cif file lists are non-overlapping (bit of an edge case, i agree)? then the ids would only contain the filename, but i think the full path would be preferable.

Or some machine-learning dataset:
ml.zip/training_set/*.cif
ml.zip/test_set/*.cif
Probably good to keep the folder structure here as well.

Yep, agreed. My suggested implementation in #55 would generate IDs as training_set/*.cif and test_set/*.cif (drops the ml.zip) for the latter example (and would keep pbe and lda for the former). I agree that finding the file is useful. I am somewhat pre-empting changes in OPTIMADE v1.2 by being able to link directly to the files endpoint instead.

eimrek · 2024-04-23T14:12:34Z

Thanks @ml-evs, I think these changes are good. Fully agree that it's not nice if the property files need to include the full path including the name of the archive/compressed file. Feel free to finalize this implementation and notify me when i could give it a test run.

ml-evs mentioned this issue Mar 31, 2024

Improve ID generation scheme #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions to improve created APIs #53

Suggestions to improve created APIs #53

ml-evs commented Mar 31, 2024 •

edited

Loading

eimrek commented Apr 19, 2024

ml-evs commented Apr 19, 2024

eimrek commented Apr 23, 2024

Suggestions to improve created APIs #53

Suggestions to improve created APIs #53

Comments

ml-evs commented Mar 31, 2024 • edited Loading

eimrek commented Apr 19, 2024

ml-evs commented Apr 19, 2024

eimrek commented Apr 23, 2024

ml-evs commented Mar 31, 2024 •

edited

Loading