-
Notifications
You must be signed in to change notification settings - Fork 984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
my project is ~6x slower + validation errors #207
Comments
It looks like its now really expensive to do things in the map_read_async callback. Did wgpu used to not block on it but now does? If I keep the contents of the callback empty then the slowdown disappears. |
@rukai thank you for the issue! I don't think #197 changed anything w.r.t. read synchronization. Perhaps, you've been using it wrong, and we didn't detect it properly, and now we do? Are you making sure that the buffers you ask for mapping are actually done processing on the GPU?
That is weird. If it was the sync, you'd still see the slowdown even with an empty callback. What exactly is going on in the callback you have? |
uh........... let width = 400;
let height = 300;
framebuffer_out.map_read_async(0, width as u64 * height as u64 * 4, move |result: wgpu::BufferMapAsyncResult<&[u32]>| {
match result {
Ok(data_u32) => {
println!("foo: {}", frame_index);
assert_eq!(width as usize * height as usize, data_u32.data.len());
let mut data_u8: Vec<u8> = vec!(0; width as usize * height as usize * 4);
for (i, value) in data_u32.data.iter().enumerate() {
data_u8[i] = *value as u8;
data_u8[i] = *value as u8;
data_u8[i] = *value as u8;
data_u8[i] = *value as u8;
data_u8[i] = *value as u8;
data_u8[i] = *value as u8;
data_u8[i] = *value as u8;
data_u8[i] = *value as u8;
data_u8[i] = *value as u8;
data_u8[i] = *value as u8;
}
frames_tx.send(data_u8).unwrap();
}
Err(error) => {
panic!("map_read_async failed: {:?}", error); // We have to panic here to avoid an infinite loop :/
}
}
}); So this is inexplicably extremely expensive. Maybe rust is failing to optimize it now for some reason? (This is the real code https://github.com/rukai/brawllib_rs/blob/e6f659802a45bd5e4e8aee56827a98dbf202c395/src/renderer.rs#L185) "Are you making sure that the buffers you ask for mapping are actually done processing on the GPU?" At this point I think I should try to come up with a proper runnable repro. I need to sleep now though |
If you request a mapping, and the buffer is in use, then you can't use this buffer until the callback is issued. Your callback appears to do quite a bit of heavy lifting, and we can't do much while it's happening. Ideally, you'd not do this work in the callback, although this may require gfx-rs/wgpu-rs#9 API changes to become an option (cc @nical). Another thing that would help is ensuring GPU is busy while your callback is happening on CPU. |
208: Ensure wait_for_fences is never called on 0 fences r=kvark a=rukai I don't see any reason why self.active should never be empty. This fixes the validation errors but not the slowdown of #207 I presume the slowdown was causing self.active to clear. Co-authored-by: Rukai <[email protected]>
Either wgpu is giving me cursed memory that is insanely slow to read from. I am aware how little sense either of those options make. assert_eq!(width as usize * height as usize, data_u32.data.len());
let instant = std::time::Instant::now();
let mut data_u8: Vec<u8> = vec!(0; width as usize * height as usize * 4);
for (i, value) in data_u32.data.iter().enumerate() {
data_u8[i * 4 + 0] = ((*value & 0x00FF0000) >> 16) as u8;
data_u8[i * 4 + 1] = ((*value & 0x0000FF00) >> 08) as u8;
data_u8[i * 4 + 2] = ((*value & 0x000000FF) >> 00) as u8;
data_u8[i * 4 + 3] = ((*value & 0xFF000000) >> 24) as u8;
}
println!("{:?}", instant.elapsed()); When it is in the wgpu callback it takes 32ms to complete. Adding more If I take the same for loop out into a separate project it reports ~350µs |
Oh interesting! Maybe it gives you coherent memory that is slow, while previously we used incoherent one?
Cc @omni-viral
… On Jun 3, 2019, at 20:51, Lucas Kent ***@***.***> wrote:
Either wgpu is giving me cursed memory that is insanely slow to read from.
Or rust is now creating insanely slow code due to a change in wgpu.
I am aware how little sense either of those options make.
But consider the below code
assert_eq!(width as usize * height as usize, data_u32.data.len());
let instant = std::time::Instant::now();
let mut data_u8: Vec<u8> = vec!(0; width as usize * height as usize * 4);
for (i, value) in data_u32.data.iter().enumerate() {
data_u8[i * 4 + 0] = ((*value & 0x00FF0000) >> 16) as u8;
data_u8[i * 4 + 1] = ((*value & 0x0000FF00) >> 08) as u8;
data_u8[i * 4 + 2] = ((*value & 0x000000FF) >> 00) as u8;
data_u8[i * 4 + 3] = ((*value & 0xFF000000) >> 24) as u8;
}
println!("{:?}", instant.elapsed());
When it is in the wgpu callback it takes 32ms to complete. Adding more data_u8[i * 4 + 2] = .... lines increases time taken linearly.
If I take the same for loop out into a separate project it reports ~350µs
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Uncached memory is slow to read. What memory usage is specified for allocation? BTW. If for |
If buffer was created for both read and write then here if !desc.usage.intersects(Bu::MAP_READ | Bu::MAP_WRITE) {
Muv::Data
} else if (Bu::MAP_WRITE | Bu::TRANSFER_SRC).contains(desc.usage) {
Muv::Upload
} else if (Bu::MAP_READ | Bu::TRANSFER_DST).contains(desc.usage) {
Muv::Download
} else {
Muv::Dynamic
} It will pick |
I just figured out I can reduce the time from 32ms to 1.3ms by calling But I'm not reading much data, just a 400x300 image of u32 colors. So it might still be too slow in general? |
Oh I get what omni-viral is saying now.
and
But it had no effect. |
Without |
Yes, without |
Problem is that you specify BufferUsage:all() for this one, getting Dynamic type. Instead, try only having Map_Read | Transfer_Dst, which would classify it as Download type.
… On Jun 4, 2019, at 07:21, Lucas Kent ***@***.***> wrote:
Yes, without to_vec it had no effect.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I suggest implementing |
Aha, changing my framebuffer to use The examples don't even use I think this issue should be good to close, unless it should be kept open to track other issues that have been uncovered? |
Created #210 to follow up |
208: Remove lifetime from SwapChainOutput r=kvark a=lachlansneff The pr removes the lifetime from `SwapChainOutput`. It's not clear to me if the lifetime is necessary for soundness reasons, so let me know if so. Resolves gfx-rs#207 Co-authored-by: Lachlan Sneff <[email protected]>
It used to take 1m35s to generate a specific set of gifs.
Now the same code is taking 9m45s to generate the same set of gifs.
This seems to have been introduced part way through #197
As that is when I measured the 1m35s, and prior to that it seemed to progress at a similar rate until it panicked.
It seems like people have been able to resolve my issues before providing a repro. So if the problem is obvious then great, otherwise let me know and I'll try figure out what exactly is slower and get a repro together.
A gif of 120 frames takes ~4s to generate.
A gif of 71 frames takes ~2.37s to generate.
Its almost like its capped at 30fps??? Which doesn't make any sense as there is no swapchain involved.
Edit: Oh durr, remembered to check the validation layers and theres some errors.
The text was updated successfully, but these errors were encountered: