Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncommunicative worker processes? #37

Closed
ym-han opened this issue Dec 20, 2020 · 2 comments
Closed

Uncommunicative worker processes? #37

ym-han opened this issue Dec 20, 2020 · 2 comments

Comments

@ym-han
Copy link

ym-han commented Dec 20, 2020

I got the following error while trying to run FileTrees with multiple workers on an embarrassingly parallel problem. The code I was running is https://github.com/ym-han/gigaword_64k/blob/main/src/gigaword_64k.jl and https://github.com/ym-han/gigaword_64k/blob/main/src/afp_mapper.jl It's basically just filtering a tree, loading it, processing it, then saving it.

I did not experience any issues when doing it with just one node (i.e., without the parallelism). One thing i haven't tried yet is running this with even fewer workers (I had 5 - 6), since the amount of data I was processing wasn't actually that large; maybe that could have been the problem.

module: unloading 'julia/1.5.0'
module: loading 'julia/1.5.0'
 Activating environment at `/gpfs/scratch/yh31/projects/gigaword_64k/Project.toml`
ERROR: LoadError: On worker 2:
TaskFailedException:
peer 7 didn't connect to 2 within 59.99998998641968 seconds
error at ./error.jl:33
wait_for_conn at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:194
check_worker_state at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:168
send_msg_ at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:176
send_msg at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:134 [inlined]
#remotecall_fetch#143 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:389 [inlined]
remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386
#remotecall_fetch#146 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421 [inlined]
#52 at /users/yh31/.julia/packages/MemPool/zSuoT/src/datastore.jl:333 [inlined]
forwardkeyerror at /users/yh31/.julia/packages/MemPool/zSuoT/src/datastore.jl:236
poolget at /users/yh31/.julia/packages/MemPool/zSuoT/src/datastore.jl:332
move at /users/yh31/.julia/packages/Dagger/k7zru/src/chunks.jl:88
move at /users/yh31/.julia/packages/Dagger/k7zru/src/chunks.jl:86 [inlined]
move at /users/yh31/.julia/packages/Dagger/k7zru/src/chunks.jl:92
macro expansion at /users/yh31/.julia/packages/Dagger/k7zru/src/sch/Sch.jl:510 [inlined]
#71 at ./task.jl:356
wait at ./task.jl:267 [inlined]
fetch at ./task.jl:282 [inlined]
_broadcast_getindex_evalf at ./broadcast.jl:648 [inlined]
_broadcast_getindex at ./broadcast.jl:621 [inlined]
getindex at ./broadcast.jl:575 [inlined]
copy at ./broadcast.jl:876
materialize at ./broadcast.jl:837 [inlined]
do_task at /users/yh31/.julia/packages/Dagger/k7zru/src/sch/Sch.jl:507
#106 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:294
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:294 [inlined]
#105 at ./task.jl:356
remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:394
remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386
remotecall_fetch(::Function, ::Int64, ::Int64, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421 [inlined]
macro expansion at /users/yh31/.julia/packages/Dagger/k7zru/src/sch/Sch.jl:541 [inlined]
(::Dagger.Sch.var"#75#76"{Dagger.OSProc,Int64,FileTrees.var"#87#90",Tuple{Dagger.Chunk{JDF.JDFFile{String},MemPool.DRef,Dagger.ThreadProc},Dagger.Chunk{JDF.JDFFile{String},MemPool.DRef,Dagger.ThreadProc}},Channel{Any},Bool,Bool,Bool,Dagger.Sch.ThunkOptions,Array{Int64,1},Dagger.NoOpLog,Dagger.Sch.SchedulerHandle})() at ./task.jl:356
Stacktrace:
 [1] compute_dag(::Dagger.Context, ::Dagger.Thunk; options::Nothing) at /users/yh31/.julia/packages/Dagger/k7zru/src/sch/Sch.jl:208
 [2] compute(::Dagger.Context, ::Dagger.Thunk; options::Nothing) at /users/yh31/.julia/packages/Dagger/k7zru/src/compute.jl:31
 [3] compute at /users/yh31/.julia/packages/Dagger/k7zru/src/compute.jl:28 [inlined]
 [4] exec(::Dagger.Context, ::Dagger.Thunk) at /users/yh31/.julia/packages/FileTrees/ZfGJB/src/parallelism.jl:75
 [5] exec(::Dagger.Thunk) at /users/yh31/.julia/packages/FileTrees/ZfGJB/src/parallelism.jl:64
 [6] save(::gigaword_64k.var"#22#34", ::FileTrees.FileTree; lazy::Nothing, exec::Bool) at /users/yh31/.julia/packages/FileTrees/ZfGJB/src/values.jl:128
 [7] save at /users/yh31/.julia/packages/FileTrees/ZfGJB/src/values.jl:111 [inlined]
 [8] process_part_of_tree(::String, ::String, ::Int64) at /gpfs/scratch/yh31/projects/gigaword_64k/src/gigaword_64k.jl:138
 [9] top-level scope at /users/yh31/scratch/projects/gigaword_64k/src/afp_mapper.jl:29
 [10] include(::Function, ::Module, ::String) at ./Base.jl:380
 [11] include(::Module, ::String) at ./Base.jl:368
 [12] exec_options(::Base.JLOptions) at ./client.jl:296
 [13] _start() at ./client.jl:506
in expression starting at /users/yh31/scratch/projects/gigaword_64k/src/afp_mapper.jl:29
┌ Warning: Forcibly interrupting busy workers
│   exception = rmprocs: pids [6] not terminated after 5.0 seconds.
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1234
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1030
    
@DrChainsaw
Copy link
Collaborator

Workers not connecting does not sound like a FileTrees issue.

Have you tried running something on the workers without using FileTrees or dagger, for example pmap(i -> (sleep(1); myid()), 1:20)) just to see that workers are really setup correctly?

@ym-han
Copy link
Author

ym-han commented Feb 20, 2021

Thanks for pointing that out. I hadn't done that (I'm not very familiar with the Julia distributed computing stack). I'll close this for now.

@ym-han ym-han closed this as completed Feb 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants