Standalone pass recording #438

kvark · 2020-01-07T05:31:31Z

Currently, each and every operation on a command encoder or a pass goes directly into the driver. This was done with the intention of having minimal overhead and maximum simplicity. However, I don't think the effort is paying off: each pass operation still has to switch on the backend type and (more importantly) lock at least one storage (e.g. for the pass).

This approach is not feasible for powering WebGPU implementation in Gecko, where wgpu leaves in a separate process, and thus direct low-overhead access on each command is not an option.

From the early days, the plan was to have "software" passes recorded on the client side and then actually provided to wgpu on a command-by-command basis on the server (after crossing the IPC). The idea was that we'd figure out all the usage across the pass, so that we can optionally pass that when we start the pass, telling wgpu to not bother inserting transitions and instead just validate that the given usages are correct.

The problems here are many, as it turns out now:

recording command-by-command still suffers from the overhead
having different code paths in wgpu-core for pass recording is a major complication

Fortunately, I believe these are solvable. My suggestion is to move to software passes consistently everywhere, not just for Gecko implementation. This would give us the following advantages:

simpler and more unified code for pass recording
strictly one native command buffer per wgpu command buffer (unlike today, where each pass ends up being a native command buffer, and we insert more for stitching...)
solves Single-recording mode for passes #89
makes Capturing internal state for debug investigations #72 easier
potentially faster recording of commands
potential for recording passes on multiple threads
potential for re-using passes (as an alternative to bundles and/or re-usable command buffers)

The text was updated successfully, but these errors were encountered:

kvark · 2020-01-07T14:34:20Z

Well, late night programming fires back. And also, the previous plan is crushed by reality...
We can't fully build transitions and states for a pass based purely on IDs, since IDs link to each other, i.e. whenever we see a TextureViewId or BindGroupId, we have to look it up in order to find the texture sub-resources, which usage need to be changed.

If we do the look-up on every operation, that means we are still going to be paying some (but less) overhead, even with software pass recording. Alternatively, the recorded passes would have no tracking information, and that would need to be rebuilt by the implementation.

kvark · 2020-01-08T05:51:51Z

Progress with #440 produced one more insight: there is no point in making passes structures open: - wgpu-native would still have to expose the same API in order to be compatible with other implementations, so it needs a struct owning the command list with methods on it for adding new commands

wgpu-remote would still have to expose some C API to Gecko, which makes our enum XxxCommand inaccessible directly anyway. It might as well be something very close to wgpu-native, if not exactly the same (code sharing? may be beneficial, but tricky since wgpu-core doesn't currently have any C API)
wgpu-rs has to use the same standard C API that wgpu-native exposes in order to target the Web.
the only client that could theoretically get an advantage of this openness is Servo

So what this is looking to be like: we'll have opaque pass builders exposed by wgpu-core. They will be wrapped in C API functions that could live in a common place, say, wgpu-core module that is pub-imported in both wgpu-native and wgpu-remote. The opaque passes would hold Vec of commands (with Vec for dynamic offsets). We could probably make CommandBuffer to own these vectors and recycle them.

439: Refactor usage tracking to be truly sparse r=dependency a=kvark ~~This is a required step towards #438 . We want to be tracking the usage by only having resource IDs around, not resources themselves. So we aren't going to have access to `full_selector`.~~ It's also streamlining some of the verbose parts of the internal API for usage tracking. It also uses `SmallVec` for the texture tracker ranges, essentially making it lightweight for most cases (where the layer count is 1). Compromises: - removes `SEPARATE_DEPTH_STENCIL_STATES` support. It's not in the near future to enable it anyway. Co-authored-by: Dzmitry Malyshau <[email protected]>

kvark · 2020-01-12T17:26:22Z

Just as #440 was getting ready to be merged, I found an issue with the original design. If the client-side recording doesn't do anything with the IDs and just passes them through, it can't guarantee that by the time the recording is finished all of the related objects are still alive.

This is actually a bigger issue than with just the code, I see it to be a problem with the spec. Suppose the user does setBindGroup with a bind group ID, and then instantly drops the bind group (e.g. by calling delete() or decreasing the ref count, or something else). If the server side receives these events in order, it's able to correctly retain the bind group for the life time of the command buffer. However, if the GPU process has a separate thread that does command buffer recording, the messages to this thread would race with the message to delete the bind group (to a thread that is responsible for the device operations). Which means, the bind group may end up being removed while it's used in the recording...

@Kangz this leads me to believe the current API is tailored towards Wire-style processing, where there is one and exactly one communication channel between client and the server, i.e. everything is serialized. Seems like a major constraint to parallelize both sides in the future (Web working recording a command buffer -> thread on the GPU proces side). Am I overthinking this?

kvark · 2020-01-12T23:02:01Z

@grovesNL here is what I'm thinking so far. We should be able to proceed with the code in #440 if we enforce the guarantee [1] by the users of wgpu-core:

wgpu-rs can expect the resource lifetime to include the pass lifetime (hopefully)
Gecko WebGPU bindings can hold references to objects inside the pass

This means that we are still going to be super fast when going through wgpu-rs, and for JS clients the burden of lifetimes is resolved on the client side by Gecko. Icy topping on the cake is that the work done in #440 doesn't need to be thrown out or reconsidered :)

Also, cc @jdashg in case you have strong opinion about ^ and time to give feedback.

[1] Any resource used by a pass recording stays alive at least till the end of the pass

grovesNL · 2020-01-12T23:23:24Z

JS clients the burden of lifetimes is resolved on the client side by Gecko

Would we have to resolve this issue in wgpu-native to implement webgpu-headers too?

kvark · 2020-01-12T23:28:15Z

@grovesNL Good point! The answer depends on how we specify webgpu-headers guarantees...

The discussion needs to take place with all the stakeholders. So far, it doesn't appear immediately useful to me to have this ability of dropping something that you are using in a pass. Most often, today at least, you'd create temporary buffers to upload data, which means you'd be doing transfer operations. It's not expected to happen in the pass.

kvark · 2020-01-13T15:45:43Z

Thought about this some more. Having the references be hold on the JS side seems most reasonable to me, since it's supposedly easier than reaching out to wgpu's refcount (for any resources we touch), which is stored on the other side of IPC.

The ability (possible, not yet tested) to guarantee this at the type level of wgpu-rs is a huge plus. It means that, if we go with wgpu-core backend like outlined in gfx-rs/wgpu-rs#156, we aren't going to pay any overhead for this during pass recording, when targeting the native.

The problem with webgpu-headers/wgpu-native is still solvable - we'll just need to have the shallow tracking of used resources in the passes, specifically when not run through wgpu-rs. Alternatively, we can adjust the native spec if Google agrees, but it's not restricting us anyway.

kvark · 2020-01-13T17:35:28Z

update: this is quite easy to bake into wgpu-rs 🎉 , gfx-rs/wgpu-rs#155 is now updated
update-2: I ported vange-rs to this version, and it politely pointed me to all the places where I was creating temporary buffers in a pass. After I refactored these, it all worked out nicely, so now I'm confident that the implementation of #440 is good quality wise.

kvark · 2020-01-13T19:33:01Z

Changes to wgpu-native would be larger than I expected. In order to access the refcounts (even shallow ones), we need the Global context, which is not implicit at the level of wgpu-core, so it can no longer expose the C API entry points to be shared between wgpu-native and wgpu-remote.

Possible ways to proceed:

fix it at the native spec level (filed Holding pass dependencies alive webgpu-native/webgpu-headers#29)
duplicate the C API entry points between wgpu-native and wgpu-remote, so that we have access to the context, and we can bump the refcounts temporarily. The bumps would be removed once the recording happens at the end of a pass.
refactor the refcounts in a way that allows us to access them without going through the context. This should be technically possible, since the refcounts don't generically depend on the backend.

kdashg · 2020-01-13T23:39:53Z

The recent WebGL client/host split for OOP does GC/CC reference tracking precisely on the client side, fwiw. I'm in favor of that solution.
I think naively adding resource strongrefs to passes/cbufs (even if they end up being errors, or never executed) is a great approach.

Kangz · 2020-01-14T14:29:32Z

In Chromium / Dawn we're looking at serializing each command separately with the (id, generation) of objects. Imagine we are doing pass.SetIndexBuffer(buf, 0) it doesn't matter if buf is garbage collected in another worker, because a reference to it still exists in the current worker and commands are well ordered in a single worker (meaning the SetIndexBuffer will be called before the current reference to buf is dropped, which itself is before the wire call to DestroyBuffer(buf)).

If you want to instead have a meta IPC command like FillRenderPass(pass, [commands]), making the JS object keep a reference to each of the objects passed in commands would work to order deletion after the wire call.

kvark added type: enhancement New feature or request area: performance How fast things go labels Jan 7, 2020

kvark self-assigned this Jan 7, 2020

kvark mentioned this issue Jan 7, 2020

Refactor usage tracking to be truly sparse #439

Merged

kvark mentioned this issue Jan 7, 2020

Stand-alone passes #440

Merged

kvark changed the title ~~Soft pass recording~~ Standalone pass recording Jan 8, 2020

bors bot closed this as completed in a72574a Jan 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone pass recording #438

Standalone pass recording #438

kvark commented Jan 7, 2020

kvark commented Jan 7, 2020

kvark commented Jan 8, 2020

kvark commented Jan 12, 2020

kvark commented Jan 12, 2020 •

edited

Loading

grovesNL commented Jan 12, 2020

kvark commented Jan 12, 2020

kvark commented Jan 13, 2020

kvark commented Jan 13, 2020 •

edited

Loading

kvark commented Jan 13, 2020 •

edited

Loading

kdashg commented Jan 13, 2020

Kangz commented Jan 14, 2020

Standalone pass recording #438

Standalone pass recording #438

Comments

kvark commented Jan 7, 2020

kvark commented Jan 7, 2020

kvark commented Jan 8, 2020

kvark commented Jan 12, 2020

kvark commented Jan 12, 2020 • edited Loading

grovesNL commented Jan 12, 2020

kvark commented Jan 12, 2020

kvark commented Jan 13, 2020

kvark commented Jan 13, 2020 • edited Loading

kvark commented Jan 13, 2020 • edited Loading

kdashg commented Jan 13, 2020

Kangz commented Jan 14, 2020

kvark commented Jan 12, 2020 •

edited

Loading

kvark commented Jan 13, 2020 •

edited

Loading

kvark commented Jan 13, 2020 •

edited

Loading