-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
asyncmap: Include original backtrace in rethrown exception #32749
Conversation
Ah |
Maybe what we should really do is |
The suggested change does aid in debugging. This would however be a breaking change since any existing error checking code would need to check for
The larger issue here is how to deal with stack traces belong to different asynchronously executed tasks. We don't have a clean approach for this yet. The stack trace you are seeing captured are where some of the Distributed APIs internally use asyncmap. This is limited to the stack trace in the task calling |
Well this seems to be a much more complex problem than I originally thought and I don't really see a simple and obvious solution. Just some thoughts: Maybe I am being over optimistic, but it seems trivial to capture the backtraces and put them in The problem is the end user will be given a complex tree of exceptions which may change in type , depth and width depending on the number of exceptions. It would probably be necessary to write some helper functions so that they could deal with it and get the original exceptions easily. Of course this would be quite a large breaking change as well. Alternatively I think it would be possible to make this wrapping of exceptions internal by unwrapping them before rethrowing the exception, concatenating the backtraces, then passing them to a new version of Finally I suppose it would also be possible to add a backtrace(s) field to every existing exception structure in the std/base library and |
For user code, a valid solution is for the mapping function to ensure that it never throws a exception but returns a CompositeException as an object in case of any errors. i.e., the mapping function wraps its logic in an overall try-catch block.
so, if you know the type of result you are expecting (say
|
One option here might be to remove the The benefit of that approach is you preserve the entire exception stack for the failed task, not just the most recently caught exception. |
96b16d4
to
d62075d
Compare
OK, I will make another attempt at this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR goes in a good direction because it starts to unify the many ways julia can distribute work, at least a little bit ;-)
A high level question: what are the desired cancellation semantics for async work when a subset of the the tasks fail? From a quick scan of the code it seems the current implementation is fail-fast, as a single failure will bring down a whole worker task which might be working on a batch of more than one piece of work.
base/asyncmap.jl
Outdated
end | ||
end | ||
end | ||
exs != nothing && throw(CompositeException(exs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop is functionality almost exactly the same as a @sync
block with the new macro @sync_add
from #32814, but @sync calls
waitrather than
fetch`. Would the following work?
@sync begin
foreach(t->@sync_add(t), worker_tasks)
end
(or just sync_end(worker_tasks)
, though that couples this to the implementation detail in task.jl)
Good question; I suppose fail-fast because the user can wrap their loop body in In my use case I want it to be robust, but I just did something similar to what @amitmurthy posted. For a large 'mapreduce' style calculation the user probably wants it to fail fast instead of waiting until the reduce stage to find out there was an error in one of the map batches. |
d62075d
to
b95f264
Compare
FYI; Seems like there is a race condition, not sure if I introduced it yet. The below hangs on my system.
Only happens with |
b95f264
to
6bf459b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks quite neat here now.
TBH I'm not sure whether we should emit CompositeException
in the fail-fast case; cancellation seems to be a rich topic in its own right and we've not really scratched the surface in designing such a system for Julia.
(Kotlin, go and python trio all have interesting ideas for structured cancellation. Some interesting reading: https://vorpus.org/blog/timeouts-and-cancellation-for-humans/
https://kotlinlang.org/docs/reference/coroutines/exception-handling.html)
end | ||
end | ||
2-element Array{Any,1}: | ||
CapturedException(ErrorException("foo"), Any[(error(::String) at error.jl:33, 1), ...]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see these docs 👍
f38fbca
to
43d58b9
Compare
I suppose this is not really fail-fast at the moment, at least relative to a system with (more) cancellation points: julia> asyncmap([3, 2, 1]) do i
sleep(0.01 * i)
error("$i")
end
ERROR: TaskFailedException:
3
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] (::var"##9#10")(::Int64) at ./REPL[2]:3
[3] (::Base.var"##726#731"{var"##9#10"})(::Base.RefValue{Any}, ::Tuple{Int64}) at ./asyncmap.jl:127
[4] macro expansion at ./asyncmap.jl:260 [inlined]
[5] (::Base.var"##742#743"{Base.var"##726#731"{var"##9#10"},Channel{Any},Nothing})() at ./task.jl:333
...and 2 more exception(s).
Stacktrace:
[1] sync_end(::Array{Any,1}) at ./task.jl:300
[2] macro expansion at ./task.jl:319 [inlined]
[3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Array{Int64,1}) at ./asyncmap.jl:205
[4] #async_usemap#721 at ./asyncmap.jl:181 [inlined]
[5] #async_usemap at ./none:0 [inlined]
[6] #asyncmap#720 at ./asyncmap.jl:108 [inlined]
[7] asyncmap(::Function, ::Array{Int64,1}) at ./asyncmap.jl:108
[8] top-level scope at REPL[2]:1
Yes, this looks good. If sleep were a cancellation point and the tasks were all in the same 'nursery', then a cancellation event could be raised for all tasks. It seems like the task would have to track anything which Anyway, for this PR, I am wondering whether to simply modify the documentation to state that it will be cancelled at the end of each loop (assuming I am not missing anything)? If cancellation points are introduced, this would be a huge breaking change, so removing the composite exception would not matter too much, relatively speaking. |
Yes, I think that's a fair comment; for the moment it seems good to be explicit about never loosing exceptions. However, it's also true that there's a particular exception which causes the cancellation (the first one which calls Personally I think it would be reasonable to go ahead with what you have here (with updated docs, and perhaps a guarantee about the ordering inside the Some extra thoughts, a little off topic:If we do introduce a whole system for cancellation and structured concurrency I'm not certain It's also interesting that kotlin has chosen not to use an equivalent of |
I will try it.
Yes this sounds reasonable, I was thinking, this might be best done as an external library for initial prototyping.
If a function fails for a truly unexpected reason (i.e. a real exception, not a timeout from a remote resource or a documented error condition) then a bunch of implementation details should come spilling out. You don't want anything to be hidden. Especially as the error may not be easily reproducible. Plus I think it would be a bit odd if Julia started trying to hide implementation details because it currently doesn't appear to hide anything :-p
Maybe (I have sometimes wanted this) or it could be that exceptions are being abused for expected errors and a |
Right. In 1.3 the runtime pieces are in place to start experimenting more seriously with the API. I know other people are interested, for example @tkf has done some prototyping at https://github.com/tkf/Awaits.jl and @vchuravy has #31086. I'd be interested to know about other efforts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a strict improvement, but we do need to think more carefully about structured concurrency (single-process and multi-process interactions)
@@ -595,7 +595,7 @@ let error_thrown = false | |||
try | |||
pmap(x -> x == 50 ? error("foobar") : x, 1:100) | |||
catch e | |||
@test e.captured.ex.msg == "foobar" | |||
@test e.exceptions[1].task.exception.captured.ex.msg == "foobar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is accessing fairly deep into the implementation details and just recently tripped up DistributedArrays.jl
JuliaParallel/DistributedArrays.jl#212
So either the test should be a series of isa
documenting the expected object-chain or you use something like unpack
Sorry I can't figure out how to do this without creating a mess. Also it only solves the problem for a specific case when this seems to be an issue for all asynchronous code. I am wondering if the I suppose another way would be to have an extra error or output channel which could collect any exceptions in order. FYI I will be away from my computer for a few days. |
Right, that's what I was thinking; I misunderstood the other uses the code currently makes of |
43d58b9
to
b9759e6
Compare
So I was able to use another channel for delivering errors in order. Back to using |
ccbffbc
to
9d25661
Compare
9d25661
to
b8311c9
Compare
This seems to work quite well, if everyone is happy with it then I can fixup the commits. Perhaps Tasks themselves and/or the |
Heh, thinking too hard about cancellation has got me into all sorts of confusion (#33248) For the current issue (and for some action we can do in the short term) I've been mulling over the Overall I think it would be cleaner to go with Having worked on the actual code, do you think it would be easy to set things up this way? It would mean that we don't use |
Assuming Tasks are cheap to create, I think it makes perfect sense to have a one-to-one correspondence between each chunk of asynchronous work and a task. This gives the system scheduler much more information about the extent of the parallelism inherent in a calculation and simplifies However, how do we limit the number of concurrent tasks with I believe there is precedent for having scheduling contexts in Operating Systems and it could fit with cancellation tokens, but I think it is way outside the scope of this PR. Currently I think it makes sense to limit the number of tasks and have them take work off of a queue. Also I think I can remove In fact, instead of wrapping the user supplied function in an outer function. It would be possible to create a wrapping task which runs an inner task synchronously. So the inner task just runs the user function as-is and can fail hard producing a |
b8311c9
to
b52d2b0
Compare
I think the choices are:
Pick two. Ignoring batching and composite exceptions, the implementation could almost be as simple as the following: function asyncmap(f, itr, ntasks)
concur = 0
tsks = map(itr) do x
t = @task begin
y = f(x)
concur -= 1
y
end
while concur >= ntasks
yield()
end
concur += 1
schedule(t)
t
end
fetch.(tsks)
end Regardless of this, I have pushed an update to the current version so that it uses |
Actually it is maybe possible to have something quite clean... function asyncmap(f, c...; ntasks=0, batch_size=nothing)
ntasks = max(ntasks, 100)
do_asyncmap(f, c, ntasks)
end
function do_asyncmap(f, c, ntasks)
echnl = Channel(ntasks)
concur = 0
tasks = Task[]
for x in zip(c...)
t = @task try
f(x...)
catch
put!(echnl, current_task())
rethrow()
finally
concur -= 1
end
while concur >= ntasks && !isready(echnl)
yield()
end
isready(echnl) && break;
concur += 1
schedule(t)
push!(tasks, t)
end
_wait.(tasks)
close(echnl)
isready(echnl) &&
throw(CompositeException(TaskFailedException.(echnl)))
task_result.(tasks)
end There doesn't seem to be a way to exit |
a0f2028
to
3fc1681
Compare
I have completely rewritten it now (syntactically), although I have gone in a bit of a circle to come back round to something more similar to the original. It now has one task per work item, partial fail-fast semantics, errors are delivered in the order they happen and it is maybe a little shorter. However things would be a lot cleaner if I still have to add some new tests for ensuring |
2e349ed
to
39410c7
Compare
I added more tests, updated the docs and cleaned up the commits. I think/hope it is ready for final review now. |
39410c7
to
4f30b9a
Compare
Collect the TaskExceptions into a CompositeException.
Avoids more complicated scheduling within asyncmap by limiting the number of concurrent tasks with a simple counter. The counter may be removed in the future as the main scheduler advances.
4f30b9a
to
6d17861
Compare
I didn’t see this PR at the time, but #42105 does something similar FYI |
Wrap the exception from the user code in CapturedException which contains the
original backtrace to the user's code.