Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster + distributed fs documentation #29

Open
xiaodaigh opened this issue Aug 26, 2020 · 6 comments
Open

Add cluster + distributed fs documentation #29

xiaodaigh opened this issue Aug 26, 2020 · 6 comments

Comments

@xiaodaigh
Copy link

xiaodaigh commented Aug 26, 2020

I see in the docs that FileTree can work on a computer with multiple processes. But can I be used on a cluster? The file folder structure doesn't seem to lend well to distributed processes.

@shashi
Copy link
Owner

shashi commented Aug 27, 2020

You would need a distributed file system. I don't think I have plans to make it work without that, because it feels like an orthogonal abstraction to FileTrees.

@shashi shashi closed this as completed Aug 27, 2020
@shashi
Copy link
Owner

shashi commented Aug 27, 2020

That said I should add this to the docs.

@shashi shashi changed the title Can a FileTree work on a cluster? Add cluster + distributed fs documentation Aug 27, 2020
@shashi shashi reopened this Aug 27, 2020
@shashi
Copy link
Owner

shashi commented Aug 27, 2020

If anybody tries this, I'd love to get the concise list of commands used. If not, I will try it myself soon.

@ym-han
Copy link

ym-han commented Aug 29, 2020

I plan to write a tutorial sometime in the next few weeks. But I thought it might be helpful to put the code I used out there first; there probably aren't any surprises, honestly:

SLURM SCRIPT

Basically what you'd expect; the following code is for an embarrassingly parallel problem; I didn't use any multithreading:

#!/bin/bash
#SBATCH --job-name="slurm_1987"
#SBATCH --output="slurm_1987.%j.%N.out"
#SBATCH --error="slurm_1987.%j.err"
#SBATCH -n 20
#SBATCH --export=ALL
#SBATCH -c 1
#SBATCH -t 28:00:00 
#SBATCH --mem-per-cpu=12G

module load julia/1.5.0
cd /users/yh31/scratch/projects/sr-nlp/data_gathering/source/jl_data_pipeline/src/prepping_entity_linker
julia --project=@. /users/yh31/scratch/projects/sr-nlp/data_gathering/source/jl_data_pipeline/src/prepping_entity_linker/main_slurm_87.jl

THE JULIA SCRIPT

try
     using ClusterManagers
catch
     import Pkg
     Pkg.add("ClusterManagers")
end

using ClusterManagers
addprocs_slurm(20, exeflags="--project=$(Base.active_project())")
# no need to add time or mem here

using Distributed

@everywhere begin
    import Pkg; Pkg.instantiate()
    include("/users/yh31/scratch/projects/sr-nlp/data_gathering/source/jl_data_pipeline/src/prepping_entity_linker/more_jl_code.jl")    
    const year = 1987
    const final_data_dir = "/users/yh31/scratch/datasets/entity_linking/final_data/prototype_aug_2020"
    const nyt_data_dir = "/gpfs/scratch/yh31/datasets/entity_linking/raw_data/nyt_corpus/data"
end

data_tree = FileTree(joinpath(nyt_data_dir, year))
linked_dfs = FileTrees.load(data_tree, lazy = true) do f
        @_ f |> string(FileTrees.path(__))|> 
                   get_data_from_file(__) |> 
                               clean!(__) |> 
                          entity_link(__) 
end

# Add corefs
coref_dfs = mapvalues(linked_dfs) do df
    @_ df |> coref_chains!(__)  |> 
        corefs_per_entity!(__)  |> 
        coref_chain_cleanup!(__)
end

out_tree= FileTrees.rename(coref_dfs, joinpath(final_data_dir, year))

FileTrees.save(out_tree) do f
    savejdf(string(FileTrees.path(f)) * ".jdf", FileTrees.get(f))
end

Basically the only slurm-related things I had to do were the sort of things one would have to do for any sort of distributed computing in julia on Slurm (though if one were doing multithreading, one would need to add the 1 - 2 lines of code you've posted in the ultimate guide to dist computing thread).

@ym-han
Copy link

ym-han commented Aug 29, 2020

Oh and I ended up just using blank dfs instead of FileTrees.NoValue(), even after you had added stuff to the master branch for that; I can't remember now why I thought blank dfs were still better than NoValue.

@DrChainsaw
Copy link
Collaborator

Hi,

I just tried out FileTrees with LSF and ClusterManagers and I got things to work pretty much out of the box.

There are a few gottchas though which maybe can be smoothed out (or maybe are impossible to fix, I dunno).

  1. From reading the dagger documentation I get the impression that the scheduler is initialized with the workers it can use. Does that mean that one needs to wait for addprocs_xxx to initialize all workers before FileTrees can start processing or else new workers will not be used? I guess I could make some experiments to verify if this is the case, but I'm hoping @shashi knows the answer already :)

  2. Also somewhat unverfied: If one worker crashes it seems like the whole result it lost. Is it possible to obtain a partial result for this case? For simple cases (pure mappings and reductions to a single value) it seems like it should be possible to just resume operations from where things stopped working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants