You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I did not experience any issues when doing it with just one node (i.e., without the parallelism). One thing i haven't tried yet is running this with even fewer workers (I had 5 - 6), since the amount of data I was processing wasn't actually that large; maybe that could have been the problem.
module: unloading 'julia/1.5.0'
module: loading 'julia/1.5.0'
Activating environment at `/gpfs/scratch/yh31/projects/gigaword_64k/Project.toml`
ERROR: LoadError: On worker 2:
TaskFailedException:
peer 7 didn't connect to 2 within 59.99998998641968 seconds
error at ./error.jl:33
wait_for_conn at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:194
check_worker_state at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:168
send_msg_ at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:176
send_msg at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:134 [inlined]
#remotecall_fetch#143 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:389 [inlined]
remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386
#remotecall_fetch#146 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421 [inlined]
#52 at /users/yh31/.julia/packages/MemPool/zSuoT/src/datastore.jl:333 [inlined]
forwardkeyerror at /users/yh31/.julia/packages/MemPool/zSuoT/src/datastore.jl:236
poolget at /users/yh31/.julia/packages/MemPool/zSuoT/src/datastore.jl:332
move at /users/yh31/.julia/packages/Dagger/k7zru/src/chunks.jl:88
move at /users/yh31/.julia/packages/Dagger/k7zru/src/chunks.jl:86 [inlined]
move at /users/yh31/.julia/packages/Dagger/k7zru/src/chunks.jl:92
macro expansion at /users/yh31/.julia/packages/Dagger/k7zru/src/sch/Sch.jl:510 [inlined]
#71 at ./task.jl:356
wait at ./task.jl:267 [inlined]
fetch at ./task.jl:282 [inlined]
_broadcast_getindex_evalf at ./broadcast.jl:648 [inlined]
_broadcast_getindex at ./broadcast.jl:621 [inlined]
getindex at ./broadcast.jl:575 [inlined]
copy at ./broadcast.jl:876
materialize at ./broadcast.jl:837 [inlined]
do_task at /users/yh31/.julia/packages/Dagger/k7zru/src/sch/Sch.jl:507
#106 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:294
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:294 [inlined]
#105 at ./task.jl:356
remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:394
remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386
remotecall_fetch(::Function, ::Int64, ::Int64, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421 [inlined]
macro expansion at /users/yh31/.julia/packages/Dagger/k7zru/src/sch/Sch.jl:541 [inlined]
(::Dagger.Sch.var"#75#76"{Dagger.OSProc,Int64,FileTrees.var"#87#90",Tuple{Dagger.Chunk{JDF.JDFFile{String},MemPool.DRef,Dagger.ThreadProc},Dagger.Chunk{JDF.JDFFile{String},MemPool.DRef,Dagger.ThreadProc}},Channel{Any},Bool,Bool,Bool,Dagger.Sch.ThunkOptions,Array{Int64,1},Dagger.NoOpLog,Dagger.Sch.SchedulerHandle})() at ./task.jl:356
Stacktrace:
[1] compute_dag(::Dagger.Context, ::Dagger.Thunk; options::Nothing) at /users/yh31/.julia/packages/Dagger/k7zru/src/sch/Sch.jl:208
[2] compute(::Dagger.Context, ::Dagger.Thunk; options::Nothing) at /users/yh31/.julia/packages/Dagger/k7zru/src/compute.jl:31
[3] compute at /users/yh31/.julia/packages/Dagger/k7zru/src/compute.jl:28 [inlined]
[4] exec(::Dagger.Context, ::Dagger.Thunk) at /users/yh31/.julia/packages/FileTrees/ZfGJB/src/parallelism.jl:75
[5] exec(::Dagger.Thunk) at /users/yh31/.julia/packages/FileTrees/ZfGJB/src/parallelism.jl:64
[6] save(::gigaword_64k.var"#22#34", ::FileTrees.FileTree; lazy::Nothing, exec::Bool) at /users/yh31/.julia/packages/FileTrees/ZfGJB/src/values.jl:128
[7] save at /users/yh31/.julia/packages/FileTrees/ZfGJB/src/values.jl:111 [inlined]
[8] process_part_of_tree(::String, ::String, ::Int64) at /gpfs/scratch/yh31/projects/gigaword_64k/src/gigaword_64k.jl:138
[9] top-level scope at /users/yh31/scratch/projects/gigaword_64k/src/afp_mapper.jl:29
[10] include(::Function, ::Module, ::String) at ./Base.jl:380
[11] include(::Module, ::String) at ./Base.jl:368
[12] exec_options(::Base.JLOptions) at ./client.jl:296
[13] _start() at ./client.jl:506
in expression starting at /users/yh31/scratch/projects/gigaword_64k/src/afp_mapper.jl:29
┌ Warning: Forcibly interrupting busy workers
│ exception = rmprocs: pids [6] not terminated after 5.0 seconds.
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1234
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1030
The text was updated successfully, but these errors were encountered:
Workers not connecting does not sound like a FileTrees issue.
Have you tried running something on the workers without using FileTrees or dagger, for example pmap(i -> (sleep(1); myid()), 1:20)) just to see that workers are really setup correctly?
I got the following error while trying to run FileTrees with multiple workers on an embarrassingly parallel problem. The code I was running is https://github.com/ym-han/gigaword_64k/blob/main/src/gigaword_64k.jl and https://github.com/ym-han/gigaword_64k/blob/main/src/afp_mapper.jl It's basically just filtering a tree, loading it, processing it, then saving it.
I did not experience any issues when doing it with just one node (i.e., without the parallelism). One thing i haven't tried yet is running this with even fewer workers (I had 5 - 6), since the amount of data I was processing wasn't actually that large; maybe that could have been the problem.
The text was updated successfully, but these errors were encountered: