Add cluster + distributed fs documentation #29

xiaodaigh · 2020-08-26T02:55:32Z

I see in the docs that FileTree can work on a computer with multiple processes. But can I be used on a cluster? The file folder structure doesn't seem to lend well to distributed processes.

shashi · 2020-08-27T03:29:39Z

You would need a distributed file system. I don't think I have plans to make it work without that, because it feels like an orthogonal abstraction to FileTrees.

shashi · 2020-08-27T03:30:27Z

That said I should add this to the docs.

shashi · 2020-08-27T03:31:27Z

If anybody tries this, I'd love to get the concise list of commands used. If not, I will try it myself soon.

ym-han · 2020-08-29T23:40:11Z

I plan to write a tutorial sometime in the next few weeks. But I thought it might be helpful to put the code I used out there first; there probably aren't any surprises, honestly:

SLURM SCRIPT

Basically what you'd expect; the following code is for an embarrassingly parallel problem; I didn't use any multithreading:

#!/bin/bash
#SBATCH --job-name="slurm_1987"
#SBATCH --output="slurm_1987.%j.%N.out"
#SBATCH --error="slurm_1987.%j.err"
#SBATCH -n 20
#SBATCH --export=ALL
#SBATCH -c 1
#SBATCH -t 28:00:00 
#SBATCH --mem-per-cpu=12G

module load julia/1.5.0
cd /users/yh31/scratch/projects/sr-nlp/data_gathering/source/jl_data_pipeline/src/prepping_entity_linker
julia --project=@. /users/yh31/scratch/projects/sr-nlp/data_gathering/source/jl_data_pipeline/src/prepping_entity_linker/main_slurm_87.jl

THE JULIA SCRIPT

try
     using ClusterManagers
catch
     import Pkg
     Pkg.add("ClusterManagers")
end

using ClusterManagers
addprocs_slurm(20, exeflags="--project=$(Base.active_project())")
# no need to add time or mem here

using Distributed

@everywhere begin
    import Pkg; Pkg.instantiate()
    include("/users/yh31/scratch/projects/sr-nlp/data_gathering/source/jl_data_pipeline/src/prepping_entity_linker/more_jl_code.jl")    
    const year = 1987
    const final_data_dir = "/users/yh31/scratch/datasets/entity_linking/final_data/prototype_aug_2020"
    const nyt_data_dir = "/gpfs/scratch/yh31/datasets/entity_linking/raw_data/nyt_corpus/data"
end

data_tree = FileTree(joinpath(nyt_data_dir, year))
linked_dfs = FileTrees.load(data_tree, lazy = true) do f
        @_ f |> string(FileTrees.path(__))|> 
                   get_data_from_file(__) |> 
                               clean!(__) |> 
                          entity_link(__) 
end

# Add corefs
coref_dfs = mapvalues(linked_dfs) do df
    @_ df |> coref_chains!(__)  |> 
        corefs_per_entity!(__)  |> 
        coref_chain_cleanup!(__)
end

out_tree= FileTrees.rename(coref_dfs, joinpath(final_data_dir, year))

FileTrees.save(out_tree) do f
    savejdf(string(FileTrees.path(f)) * ".jdf", FileTrees.get(f))
end

Basically the only slurm-related things I had to do were the sort of things one would have to do for any sort of distributed computing in julia on Slurm (though if one were doing multithreading, one would need to add the 1 - 2 lines of code you've posted in the ultimate guide to dist computing thread).

ym-han · 2020-08-29T23:45:41Z

Oh and I ended up just using blank dfs instead of FileTrees.NoValue(), even after you had added stuff to the master branch for that; I can't remember now why I thought blank dfs were still better than NoValue.

DrChainsaw · 2020-09-19T20:59:26Z

Hi,

I just tried out FileTrees with LSF and ClusterManagers and I got things to work pretty much out of the box.

There are a few gottchas though which maybe can be smoothed out (or maybe are impossible to fix, I dunno).

From reading the dagger documentation I get the impression that the scheduler is initialized with the workers it can use. Does that mean that one needs to wait for addprocs_xxx to initialize all workers before FileTrees can start processing or else new workers will not be used? I guess I could make some experiments to verify if this is the case, but I'm hoping @shashi knows the answer already :)
Also somewhat unverfied: If one worker crashes it seems like the whole result it lost. Is it possible to obtain a partial result for this case? For simple cases (pure mappings and reductions to a single value) it seems like it should be possible to just resume operations from where things stopped working.

shashi closed this as completed Aug 27, 2020

shashi changed the title ~~Can a FileTree work on a cluster?~~ Add cluster + distributed fs documentation Aug 27, 2020

shashi reopened this Aug 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cluster + distributed fs documentation #29

Add cluster + distributed fs documentation #29

xiaodaigh commented Aug 26, 2020 •

edited

Loading

shashi commented Aug 27, 2020

shashi commented Aug 27, 2020

shashi commented Aug 27, 2020

ym-han commented Aug 29, 2020 •

edited

Loading

ym-han commented Aug 29, 2020 •

edited

Loading

DrChainsaw commented Sep 19, 2020

Add cluster + distributed fs documentation #29

Add cluster + distributed fs documentation #29

Comments

xiaodaigh commented Aug 26, 2020 • edited Loading

shashi commented Aug 27, 2020

shashi commented Aug 27, 2020

shashi commented Aug 27, 2020

ym-han commented Aug 29, 2020 • edited Loading

ym-han commented Aug 29, 2020 • edited Loading

DrChainsaw commented Sep 19, 2020

xiaodaigh commented Aug 26, 2020 •

edited

Loading

ym-han commented Aug 29, 2020 •

edited

Loading

ym-han commented Aug 29, 2020 •

edited

Loading