Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easily access datasets on Rucio data lake #156

Open
matbun opened this issue Jun 17, 2024 · 4 comments · May be fixed by #165
Open

Easily access datasets on Rucio data lake #156

matbun opened this issue Jun 17, 2024 · 4 comments · May be fixed by #165

Comments

@matbun
Copy link
Collaborator

matbun commented Jun 17, 2024

Add a Python function capable of translating a namespaced Rucio dataset/file to the absolute path on the local filesystem of the datacenter (e.g., HPC) on which the code is currently running.

Sth like namespace_to_path('jdoe:physics_dataset') returning:

  • '/dacache/slling.si/.../physics_dataset' when on HPC1
  • '/other/path/.../physics_dataset' when on HPC2

The dataset can or cannot be on the HPC:

  • When the dataset is available on the local RSE, return a list of paths to the dataset files (or just the path to the root directory of the dataset, if you prefer).
  • When the dataset is not there, create a Rucio rule for async copy of the dataset and raise a custom exception to inform the user that the dataset is not present at the moment and that the job cannot continue, although a rule has been created.

How to proceed;

  • Create a rucio.py module under src/itwinai/ to store the python function meant to convert a rucio dataset to the absolute path on the local RSE
  • Add tests in a test_rucio.py file under tests/

Once this is done, we will integrate it with other itwinai modules (e.g., config parser and CLI)

@matbun
Copy link
Collaborator Author

matbun commented Jun 17, 2024

@garciagenrique

@garciagenrique
Copy link
Collaborator

Hello @matbun,

After speaking with few people at CERN, there are two "main" way to interact with RUCIO data.

  1. Download the desired dataset into the localhost.
  2. Make a replication rule so that the files are available within the "local" RSE (RUCIO Storage Element), i.e., the distributed storage that should exists on each of the data centers. (And that should be mounted when you are logged in).

Option 1 takes much more time that option 2. Furthermore, you would need to keep an internet connection open during the whole download.

Therefore, we should go with option 2.

I can already create a small bash script for VEGA that simlinks all the dataset files into a txt file, that we would need to adapt for each of the data centers. Step by step ;-).

Let me know where I can add this script within itwinai.

@matbun matbun changed the title Rucio integration Easily access dataset on Rucio data lake Jun 20, 2024
@matbun matbun changed the title Easily access dataset on Rucio data lake Easily access datasets on Rucio data lake Jun 20, 2024
@matbun
Copy link
Collaborator Author

matbun commented Jun 20, 2024

I have created a new tutorial folder on a new branch: https://github.com/interTwin-eu/itwinai/tree/156-easily-access-datasets-on-rucio-data-lake/tutorials/data-lake/pull-dataset

@garciagenrique could you please add an example of "option 2" with some documentation? The goal is giving such example to the interTwin use cases, so that they can reproduce it for their datasets. Perhaps a couple of links to Rucio docs would help as well.

Thanks!

@matbun
Copy link
Collaborator Author

matbun commented Dec 10, 2024

Hey @garciagenrique, I have updated the issue description with what we discussed yesterday and with some suggestions on where to create the python module and tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants