Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the BIA retrieval tool #11

Open
2 of 7 tasks
kostrykin opened this issue Jul 18, 2024 · 12 comments · May be fixed by bgruening/galaxytools#1541
Open
2 of 7 tasks

Refactor the BIA retrieval tool #11

kostrykin opened this issue Jul 18, 2024 · 12 comments · May be fixed by bgruening/galaxytools#1541
Assignees
Labels
BHEU24 Has been worked on at the BHEU2024 tools dev This issue involves tool development

Comments

@kostrykin
Copy link
Collaborator

kostrykin commented Jul 18, 2024

There is a tool for downloading images from the Bioimage Archive:
https://imaging.usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/bgruening/bia_download/bia_download/0.1.0+galaxy0

Image


The UI of this tool needs some love:

  • The "storage mode" can only be nfs or fire, this should thus be a dropdown field.
  • The help text of the field should say when to use which of the two options.
  • The "path of accession" field is unclear — what does a user need to put in here? Add a help text.
  • Why are both fields optional? Is that correct?
  • Is it possible to retrieve only part of a dataset?

Optional:

  • If it is possible to retrieve only part of a dataset, this should be implemented.
  • Migrate to the alpha version of the API? Stable API could be available by now (Beatriz should have further info).
@kostrykin kostrykin added this to BH2024 Jul 18, 2024
@kostrykin kostrykin converted this from a draft issue Jul 18, 2024
@kostrykin kostrykin added the tools dev This issue involves tool development label Jul 18, 2024
@B0r1sD
Copy link
Collaborator

B0r1sD commented Nov 4, 2024

I'm looking into this task.

Tool source: tools/image_processing/bia-ftplinks

Is it helpful to add this tool to the IUC tool repository, so we also make use of their tests and best practices?

@B0r1sD
Copy link
Collaborator

B0r1sD commented Nov 4, 2024

Started tracking the progress in this draft PR: bgruening/galaxytools#1541

  • The "storage mode" can only be nfs or fire, this should thus be a dropdown field.
  • The help text of the field should say when to use which of the two options.

but I have found info about it in the following places:

And some context:
FIRE stands for FIle REplication, EMBL-EBI’s very large-scale object data storage system. This provides long-term sustainable storage, operational redundancy, and backup to tape. Dataset level metadata are stored in a MongoDB database. The system backend is coded in Kotlin.

  • They are essential, added.
  • Why are both fields optional? Is that correct?

@kostrykin
Copy link
Collaborator Author

Thanks @B0r1sD!

  • This point is not clear to me yet:
  • The "storage mode" can only be nfs or fire, this should thus be a dropdown field.

Right now "storage mode" is a text field, but according to the help text of the field, only two values are accepted (either nfs or fire). In that case, this field should either be a dropdown field, where one of the two options can be selected? Since any other input value would be invalid and invalid input should be prevented by a good UI.

  • The help text of the field should say when to use which of the two options.

To me and @beatrizserrano it wasn't immediately clear how to determine the correct value for this input. I see from your explanations what either of the two is, but still, how is the user supposed to determine the correct value for input here? Can we add a help text here to provide some guidance?

@B0r1sD
Copy link
Collaborator

B0r1sD commented Nov 5, 2024

  • Mail sent to BIA to ask about nfs/fire, exert from reply:

To answer your question, NFS is our older storage system and FIRE is the new one. So, we have some files still in NFS while the newer ones are on FIRE. There is not a recommended mode for FTP download, but you will need to use the correct one for each dataset. Sorry if this was unclear.
You can find out what is the storage mode for a dataset using this command:

curl https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ebi.ac.uk%2Fbiostudies%2Fapi%2Fv1%2Fstudies%2FS-BIAD570%2Finfo&data=05%7C02%7Cboris.depoortere%40vib.be%7C210c86d23fcb4025e91e08dcfd8c3c39%7C2d714a65b97f41a98ff1ec2cdf7df5cd%7C0%7C0%7C638664025957330959%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=sV4V1uORpINn7y1pDnhDOh7HylKpCknw0Epra8EWDCA%3D&reserved=0 -s | jq -r .ftpLink

  • Other checklist points addressed and committed to the PR.
  • Looked into downloading subsets via wget. It should be possible, but have to look into how to implement it. Presumably let the user input a list of subsets and then implementing the download script from BIA in the wrapper.
    image.
    Example of how the downloaded script looks like:
#!/bin/bash
# Run this file in bash with this command:  ./filename
HOST=ftp.ebi.ac.uk
USER=anonymous
ftp -pinv $HOST <<EOF
user $USER
cd biostudies/fire/S-BIAD/458/S-BIAD1458/Files
binary
mget "Red blood cell differential image data/data/0-0.3/0(11).jpg"
mget "Red blood cell differential image data/data/0-0.3/0(2).jpg"
disconnect
bye
EOF

@kostrykin
Copy link
Collaborator Author

  • Mail sent to BIA to ask about nfs/fire, exert from reply:

To answer your question, NFS is our older storage system and FIRE is the new one. So, we have some files still in NFS while the newer ones are on FIRE. There is not a recommended mode for FTP download, but you will need to use the correct one for each dataset. Sorry if this was unclear.
You can find out what is the storage mode for a dataset using this command:
>
curl https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ebi.ac.uk%2Fbiostudies%2Fapi%2Fv1%2Fstudies%2FS-BIAD570%2Finfo&data=05%7C02%7Cboris.depoortere%40vib.be%7C210c86d23fcb4025e91e08dcfd8c3c39%7C2d714a65b97f41a98ff1ec2cdf7df5cd%7C0%7C0%7C638664025957330959%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=sV4V1uORpINn7y1pDnhDOh7HylKpCknw0Epra8EWDCA%3D&reserved=0 -s | jq -r .ftpLink

Cool can we use the curl command in the tool wrapper to determine the correct mode automatically?

@B0r1sD
Copy link
Collaborator

B0r1sD commented Nov 5, 2024

Yeah that would be ideal.

@kostrykin
Copy link
Collaborator Author

Yeah that would be ideal.

Let me know if you need any help!

@kostrykin
Copy link
Collaborator Author

@B0r1sD What's your current state? We need to report our state tomorrow. It would be ideal if you could tick the boxes! 🥳

@B0r1sD
Copy link
Collaborator

B0r1sD commented Nov 7, 2024

Current state: we're in talks with folks from BIA to add a button on their website to seamlessly integrate a data retrieval method similar to some 'Get Data' tools (UCSC, EBI SRA,...).

In the meanwhile, I got an answer about the FIRE/NFs:

right now all our studies have been migrated to FIRE storage. However, we are introducing a new feature that will use NFS as a storage option again. This will mean that soon we’ll have data on NFS and FIRE soon, so you probably want to keep that in mind.

So we decided to keep the dropdown but let the FIRE option be default and thoroughly explain why there are two options (and when to choose what).
In the future, when they will reuse NFS, we can look into integrating the curl + jq command that checks if it's FIRE or NFS. This command worked:

curl "https://www.ebi.ac.uk/biostudies/api/v1/studies/S-BIAD570/info" -s | jq '.. | .ftpLink? // empty'

But makes use of their API that is in alpha.

Via the ftp link, the nfs/fire information is not directly included (could get found later if we change the wrapper).

curl "https://ftp.ebi.ac.uk/biostudies/fire/S-BIAD/570/S-BIAD570/" -s

@B0r1sD
Copy link
Collaborator

B0r1sD commented Nov 7, 2024

Information communicated to the BIA:

Technical details

General flow

  • User starts at Galaxy, gets sent to external resource with 'GALAXY_URL' parameter.
  • User browses external site and selects options, sends data to Galaxy by using the GALAXY_URL parameter, providing with it a URL parameter that contains where Galaxy should inform the external site of the final GALAXY_URL
  • Galaxy contacts 'URL', with a new GALAXY_URL (the page content of accessing 'URL' should end with 'OK')
  • When data is ready, the external site contacts the new GALAXY_URL, providing 'URL' which contains final data and 'STATUS' which should be 'OK' (when successful)
  • Data is loaded into the Galaxy history.

Depending on how the data is fetched at your end, the depositing of data should be either implemented synchronously (docs) or asynchronously (docs). The synchronous implementation is less complex but depending on your backend, this could simply not be an option. @wm75 is an expert in this implementation so can provide technical support where needed.

Code implementation example(s)

The following Github repo contains the example scripts for the implementation on 3 different Python web framework (Cherrypy, Django, Flask): https://github.com/hexylena/galaxy-data_source-examples
The lines of code Björn was referring to would look like this in Cherrypy: https://github.com/hexylena/galaxy-data_source-examples/blob/main/cherrypy/server.py, that also comes with documentation: https://github.com/hexylena/galaxy-data_source-examples/tree/main/cherrypy#overview.

Examples

Below is an example how this feature was implemented by the UCSC for their Tablebrowser, from both perspectives.

Data(base) side

Below, two examples of active implementations are shown, which is the relevant perspective for your team.

-UCSC Tableviewer-

      

     

-EBI SRA-

A video (from 2015) showing the workflow and how EBI implemented this on their side for the European Short Read Archive:

https://vimeo.com/121187220

https://usegalaxy.eu/tool_runner/data_source_redirect?tool_id=ebi_sra_main

Image

Galaxy side

This is how the Galaxy tool (or 'wrapper') would look like on Galaxy's side: XML file example for the UCSC Tablebrowser: https://github.com/galaxyproject/galaxy/blob/dev/tools/data_source/ucsc_tablebrowser.xml. More technical information on this tool of the 'data source' type can be found here: https://docs.galaxyproject.org/en/latest/dev/data_source.html. This is something we would develop and provide.

@B0r1sD
Copy link
Collaborator

B0r1sD commented Nov 7, 2024

The retrieval tool also only works for studies that are part of BioImages - Core collection (with an accession that looks like S-BIAD0000). This is not the only study collection on there so I will document this in the wrapper for now and see how the seamless integration button progresses as this would make this tool obsolete (so I don't see the point now to implement an error catch or feature that works with all types of studies e.g. S-JCBD-201709074).

@B0r1sD
Copy link
Collaborator

B0r1sD commented Nov 8, 2024

Having some issues serving the tool locally, the last change is a more verbose help section which I will add later:

  • Storage mode
    FIle REplication or FIRE is EMBL-EBI’s very large-scale object data storage system. At the moment of writing, all their studies have been migrated to FIRE storage hence it being the default option. However, they are introducing a new feature that will use NFS as a storage option, so the study you are referring to might live on NFS in the near future. This is the reason both option are available.

  • Accession number:
    This tool only supports studies part of the 'BioImages - Core' collection (with an accession number that follows the S-BIAD0000 pattern).

@kostrykin kostrykin added the BHEU24 Has been worked on at the BHEU2024 label Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BHEU24 Has been worked on at the BHEU2024 tools dev This issue involves tool development
Projects
Status: In progress
Development

Successfully merging a pull request may close this issue.

2 participants