-
Notifications
You must be signed in to change notification settings - Fork 25
Publishing Indexes to Main
In order to avoid duplication, we currently build indexes (e.g. with data managers) on Test and then once tested, "publish" them to Main. However, since the introduction of CVMFS, we are duplicating the data anyway.
New DMs can be installed by using the Test Installer. If the DM needs more memory than the default allocation (8 GB), be sure to modify job_conf.xml
in the playbook. See the entries for existing DMs: the datamanager server should be the handler. Test will need to be restarted after changes are made (ansible-env test config
will do this for you).
See issue #31 for important details about paths that need to be fixed for newly-installed DMs
Once indexes are built and are ready to publish (after the procedure below), you will need to update /cvmfs/data.galaxyproject.org/managed/bin/managed.py
to understand the new index. Necessary changes include adding the new index to:
- the
locmap
dict. The value is the path to the DM's loc file, relative to/galaxy-repl
. If the DM has multiple loc files, this can be a list. - the
loccols
dict. The value is a dict (or list of dicts) with keys corresponding to the columns of thebuild
,dbkey
, andpath
data table columns. Not all are required (depending on what columns are present in the data table for that location file) - the
index_subdirs
dict, if the index directory is a subdirectory of the genome build in/galaxy-repl/manageddata/data
. The value is the name of the subdirectory in the genome directory. - the
index_dirs
dict, if the index directory is a subdirectory of/galaxy-repl/manageddata/data
.
Once complete, add the appropriate data table entry to /cvmfs/data.galaxyproject.org/managed/location/tool_data_table_conf.xml
. See existing entries as an example.
Go to https://test-datamanager.galaxyproject.org/ and use the installed DMs to create new indexes. New data may not be available to Test until it has been restarted. Once verified, it can be published in CVMFS with:
$ ssh [email protected]
$ cvmfs_server transaction data.galaxyproject.org
$ /cvmfs/data.galaxyproject.org/managed/bin/managed.py <index> # where <index> is a key in `locmap`
$ cvmfs_server publish -a <some_tag> -m <some_message> data.galaxyproject.org
NOTE: If the transaction is too large, your SSH session might time out before the publish command finishes, aborting the publish. To prevent this, one can use something like tmux
, screen
, or nohup
.
A bit about what's going on: managed.py
is essentially rsyncing data from Corral @ TACC into CVMFS, and modifying paths/location files as it does. We probably should not modify the data paths, but I chose to do so for easing the CVMFS catalog size splits. Where DM data is installed at:
/galaxy-repl/manageddata/data/<genome_build>/<index_name>/<build_id>/...
I copy this to CVMFS as:
/cvmfs/data.galaxyproject.org/managed/<index_name>/<build_id>/...
Location files are copied from their DM paths (/galaxy-repl/test/tool_data/...
) to /cvmfs/data.galaxyproject.org/managed/location
and paths are updated to the correct locations in CVMFS.
You can force these to be rapidly distributed by logging in to the stratum 1 CVMFS servers as g2test
and running cvmfs_server snapshot data.galaxyproject.org && systemctl restart squid
, then wiping the cache on Main servers (galaxy-web-{01..04}
) with /usr/local/bin/cvmfs_wipecache
(sorry, no automated process yet) but if you don't do this, changes should propagate over a few hours. Main may need to be restarted (gracefully) to pick up the new indexes on the web handlers (galaxy-web-{05,06}
).
[g2main@galaxy-web-05 ~]$ galaxyctl graceful
galaxy_main_uwsgi:zergling0: started
[g2main@galaxy-web-06 ~]$ galaxyctl graceful
galaxy_main_uwsgi:zergling0: started
If you really want to watch the workers spin up and down while this is going on, you could use the command galaxyctl graceful && watch $HOME/bin/supervisorctl status
Jetstream mounts /cvmfs/data.galaxyproject.org
, so nothing should be necessary to make the data available on Jetstream
With the Parrot connector, CVMFS is now available on Bridges, so the following section is for anyone curious.
Stampede and Bridges do not mount CVMFS, so it's necessary to copy the data. Stampede is currently behind and I need to change some things to update it, but it should be working fine for most data. At present, the only tool running on Bridges that needs reference data is (RNA) STAR, so I copied those by hand. However, due to the path changes we make for Main, the following information is relevant:
On Test, we use data as installed by the DMs, but on Main we modify the structure to live under a per-index directory. e.g. on Test:
hg19/rnastar_index2/hg19/...
On Main:
rnastar_index2/hg19/...
In order to make the data work for both, I copy from TACC using the Test layout and then create symlinks.
To copy data from TACC (any VM at TACC should work):
$ cd /galaxy-repl/manageddata/data
$ for d in $(find . -type d -name rnastar_index2); do rsync -avP --relative /galaxy-repl/manageddata/data/${d} [email protected]:/pylon2/mc48nsp/xcgalaxy/data; done
The use of find .
is critical - when using --relative and a path contains /./
, the path components before the /./
are stripped from the remote side.
To create symlinks for Main (as xcgalaxy
on Bridges):
$ cd /pylon2/mc48nsp/xcgalaxy/data
$ mkdir -p rnastar_index2
$ for b in */rnastar_index2/*; do ln -s ../$b rnastar_index2/$(basename $b); done
The need for copying can be eliminated. The changes to support this are outlined in #30.