my project is ~6x slower + validation errors #207

rukai · 2019-06-03T10:35:26Z

It used to take 1m35s to generate a specific set of gifs.
Now the same code is taking 9m45s to generate the same set of gifs.

This seems to have been introduced part way through #197
As that is when I measured the 1m35s, and prior to that it seemed to progress at a similar rate until it panicked.

It seems like people have been able to resolve my issues before providing a repro. So if the problem is obvious then great, otherwise let me know and I'll try figure out what exactly is slower and get a repro together.

A gif of 120 frames takes ~4s to generate.
A gif of 71 frames takes ~2.37s to generate.

Its almost like its capped at 30fps??? Which doesn't make any sense as there is no swapchain involved.

Edit: Oh durr, remembered to check the validation layers and theres some errors.

VALIDATION [VUID_Undefined (0)] : vkWaitForFences: parameter fenceCount must be greater than 0.
object info: (type: UNKNOWN, hndl: 0)

The text was updated successfully, but these errors were encountered:

rukai · 2019-06-03T13:10:14Z

It looks like its now really expensive to do things in the map_read_async callback. Did wgpu used to not block on it but now does?

If I keep the contents of the callback empty then the slowdown disappears.

kvark · 2019-06-03T13:56:40Z

@rukai thank you for the issue! I don't think #197 changed anything w.r.t. read synchronization. Perhaps, you've been using it wrong, and we didn't detect it properly, and now we do?

Are you making sure that the buffers you ask for mapping are actually done processing on the GPU?

If I keep the contents of the callback empty then the slowdown disappears.

That is weird. If it was the sync, you'd still see the slowdown even with an empty callback. What exactly is going on in the callback you have?

rukai · 2019-06-03T14:04:05Z

uh...........

        let width = 400;
        let height = 300;
        framebuffer_out.map_read_async(0, width as u64 * height as u64 * 4, move |result: wgpu::BufferMapAsyncResult<&[u32]>| {
            match result {
                Ok(data_u32) => {
                    println!("foo: {}", frame_index);
                    assert_eq!(width as usize * height as usize, data_u32.data.len());
                    let mut data_u8: Vec<u8> = vec!(0; width as usize * height as usize * 4);
                    for (i, value) in data_u32.data.iter().enumerate() {
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                    }
                    frames_tx.send(data_u8).unwrap();
                }
                Err(error) => {
                    panic!("map_read_async failed: {:?}", error); // We have to panic here to avoid an infinite loop :/
                }
            }
        });

So this is inexplicably extremely expensive. Maybe rust is failing to optimize it now for some reason?
The more data_u8[i] = *value as u8 that I add the slower it gets.

(This is the real code https://github.com/rukai/brawllib_rs/blob/e6f659802a45bd5e4e8aee56827a98dbf202c395/src/renderer.rs#L185)

"Are you making sure that the buffers you ask for mapping are actually done processing on the GPU?"
No?
I just set up the map_read_async assuming it will be run as soon as its ready to be run.
Is there something else I should be doing?

At this point I think I should try to come up with a proper runnable repro.

I need to sleep now though

kvark · 2019-06-03T14:12:11Z

If you request a mapping, and the buffer is in use, then you can't use this buffer until the callback is issued. Your callback appears to do quite a bit of heavy lifting, and we can't do much while it's happening. Ideally, you'd not do this work in the callback, although this may require gfx-rs/wgpu-rs#9 API changes to become an option (cc @nical).

Another thing that would help is ensuring GPU is busy while your callback is happening on CPU.

208: Ensure wait_for_fences is never called on 0 fences r=kvark a=rukai I don't see any reason why self.active should never be empty. This fixes the validation errors but not the slowdown of #207 I presume the slowdown was causing self.active to clear. Co-authored-by: Rukai <[email protected]>

rukai · 2019-06-04T00:51:36Z

Either wgpu is giving me cursed memory that is insanely slow to read from.
Or rust is now creating insanely slow code due to a change in wgpu.

I am aware how little sense either of those options make.
But consider the below code

                    assert_eq!(width as usize * height as usize, data_u32.data.len());
                    let instant = std::time::Instant::now();
                    let mut data_u8: Vec<u8> = vec!(0; width as usize * height as usize * 4);
                    for (i, value) in data_u32.data.iter().enumerate() {
                        data_u8[i * 4 + 0] = ((*value & 0x00FF0000) >> 16) as u8;
                        data_u8[i * 4 + 1] = ((*value & 0x0000FF00) >> 08) as u8;
                        data_u8[i * 4 + 2] = ((*value & 0x000000FF) >> 00) as u8;
                        data_u8[i * 4 + 3] = ((*value & 0xFF000000) >> 24) as u8;
                    }
                    println!("{:?}", instant.elapsed());

When it is in the wgpu callback it takes 32ms to complete. Adding more data_u8[i * 4 + 2] = .... lines increases time taken linearly.

If I take the same for loop out into a separate project it reports ~350µs

kvark · 2019-06-04T01:08:23Z

Oh interesting! Maybe it gives you coherent memory that is slow, while previously we used incoherent one? Cc @omni-viral

…

On Jun 3, 2019, at 20:51, Lucas Kent ***@***.***> wrote: Either wgpu is giving me cursed memory that is insanely slow to read from. Or rust is now creating insanely slow code due to a change in wgpu. I am aware how little sense either of those options make. But consider the below code assert_eq!(width as usize * height as usize, data_u32.data.len()); let instant = std::time::Instant::now(); let mut data_u8: Vec<u8> = vec!(0; width as usize * height as usize * 4); for (i, value) in data_u32.data.iter().enumerate() { data_u8[i * 4 + 0] = ((*value & 0x00FF0000) >> 16) as u8; data_u8[i * 4 + 1] = ((*value & 0x0000FF00) >> 08) as u8; data_u8[i * 4 + 2] = ((*value & 0x000000FF) >> 00) as u8; data_u8[i * 4 + 3] = ((*value & 0xFF000000) >> 24) as u8; } println!("{:?}", instant.elapsed()); When it is in the wgpu callback it takes 32ms to complete. Adding more data_u8[i * 4 + 2] = .... lines increases time taken linearly. If I take the same for loop out into a separate project it reports ~350µs — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

zakarumych · 2019-06-04T09:35:38Z

Uncached memory is slow to read. What memory usage is specified for allocation? Download memory should be cached and fitting for reads. And Upload usage avoids cached memory types.

BTW. If for wgpu cases set of memory usage types is limiting - allocation function takes impl MemoryUsage so one can make custom memory usage type.

zakarumych · 2019-06-04T09:45:58Z

If buffer was created for both read and write then here

if !desc.usage.intersects(Bu::MAP_READ | Bu::MAP_WRITE) {
  Muv::Data
} else if (Bu::MAP_WRITE | Bu::TRANSFER_SRC).contains(desc.usage) {
  Muv::Upload
} else if (Bu::MAP_READ | Bu::TRANSFER_DST).contains(desc.usage) {
  Muv::Download
} else {
  Muv::Dynamic
}

It will pick Upload over Download. Should be vice versa I think.
And I should either add this to documentation, or add memory usage type for read'n'write.

rukai · 2019-06-04T10:02:25Z

I just figured out I can reduce the time from 32ms to 1.3ms by calling .to_vec() on the slice I am given first, and then reading from the vec instead of the slice. This is suitable for my needs, I'm now bottlenecked elsewhere.

But I'm not reading much data, just a 400x300 image of u32 colors. So it might still be too slow in general?
Using an image size of 1024x768 and the to_vec() method, takes ~10ms

rukai · 2019-06-04T10:10:38Z

Oh I get what omni-viral is saying now.
I tried swapping the order of

} else if (Bu::MAP_WRITE | Bu::TRANSFER_SRC).contains(desc.usage) {
  Muv::Upload

and

} else if (Bu::MAP_READ | Bu::TRANSFER_DST).contains(desc.usage) {
  Muv::Download

But it had no effect.

zakarumych · 2019-06-04T10:48:00Z

But it had no effect.

Without to_vec thing?

rukai · 2019-06-04T11:21:02Z

Yes, without to_vec it had no effect.

kvark · 2019-06-04T11:57:38Z

Problem is that you specify BufferUsage:all() for this one, getting Dynamic type. Instead, try only having Map_Read | Transfer_Dst, which would classify it as Download type.

…

On Jun 4, 2019, at 07:21, Lucas Kent ***@***.***> wrote: Yes, without to_vec it had no effect. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

zakarumych · 2019-06-04T12:20:40Z

Dynamic is not really great choice for BufferUsage:all() as it also avoids cached memory so reads are awful.

I suggest implementing rendy::memory::MemoryUsage for BufferUsage to consider all set flags appropriately.

rukai · 2019-06-04T12:41:08Z

Aha, changing my framebuffer to use wgpu::BufferUsage::MAP_READ | wgpu::BufferUsage::TRANSFER_DST and using rendy master branch, it now completes in ~250µs. Much better!

The examples don't even use BufferUsage::all(), so there is nothing to change there, I must have just added it myself while initially trying to get things working.
Hopefully in the future, people will be able realize the cause themselves, because they have just written the needless BufferUsage. For me it was harder to reason about, because it was only fast before due to a bug in wgpu, and then suddenly not fast.

I think this issue should be good to close, unless it should be kept open to track other issues that have been uncovered?

kvark · 2019-06-04T13:46:25Z

Created #210 to follow up

208: Remove lifetime from SwapChainOutput r=kvark a=lachlansneff The pr removes the lifetime from `SwapChainOutput`. It's not clear to me if the lifetime is necessary for soundness reasons, so let me know if so. Resolves gfx-rs#207 Co-authored-by: Lachlan Sneff <[email protected]>

rukai changed the title ~~my project is ~6x slower~~ my project is ~6x slower + validation errors Jun 3, 2019

rukai mentioned this issue Jun 3, 2019

Ensure wait_for_fences is never called on 0 fences #208

Merged

kvark added the type: question Further information is requested label Jun 3, 2019

kvark mentioned this issue Jun 4, 2019

Improve Rendy usage semantics #210

Closed

kvark closed this as completed Jun 4, 2019

Patryk27 pushed a commit to Patryk27/wgpu that referenced this issue Nov 23, 2022

[glsl] write the entry point (gfx-rs#207)

ca492e4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

my project is ~6x slower + validation errors #207

my project is ~6x slower + validation errors #207

rukai commented Jun 3, 2019 •

edited

Loading

rukai commented Jun 3, 2019 •

edited

Loading

kvark commented Jun 3, 2019

rukai commented Jun 3, 2019 •

edited

Loading

kvark commented Jun 3, 2019

rukai commented Jun 4, 2019

kvark commented Jun 4, 2019 via email

zakarumych commented Jun 4, 2019 •

edited

Loading

zakarumych commented Jun 4, 2019 •

edited

Loading

rukai commented Jun 4, 2019 •

edited

Loading

rukai commented Jun 4, 2019

zakarumych commented Jun 4, 2019

rukai commented Jun 4, 2019

kvark commented Jun 4, 2019 via email

zakarumych commented Jun 4, 2019 •

edited

Loading

rukai commented Jun 4, 2019 •

edited

Loading

kvark commented Jun 4, 2019

my project is ~6x slower + validation errors #207

my project is ~6x slower + validation errors #207

Comments

rukai commented Jun 3, 2019 • edited Loading

rukai commented Jun 3, 2019 • edited Loading

kvark commented Jun 3, 2019

rukai commented Jun 3, 2019 • edited Loading

kvark commented Jun 3, 2019

rukai commented Jun 4, 2019

kvark commented Jun 4, 2019 via email

zakarumych commented Jun 4, 2019 • edited Loading

zakarumych commented Jun 4, 2019 • edited Loading

rukai commented Jun 4, 2019 • edited Loading

rukai commented Jun 4, 2019

zakarumych commented Jun 4, 2019

rukai commented Jun 4, 2019

kvark commented Jun 4, 2019 via email

zakarumych commented Jun 4, 2019 • edited Loading

rukai commented Jun 4, 2019 • edited Loading

kvark commented Jun 4, 2019

rukai commented Jun 3, 2019 •

edited

Loading

rukai commented Jun 3, 2019 •

edited

Loading

rukai commented Jun 3, 2019 •

edited

Loading

zakarumych commented Jun 4, 2019 •

edited

Loading

zakarumych commented Jun 4, 2019 •

edited

Loading

rukai commented Jun 4, 2019 •

edited

Loading

zakarumych commented Jun 4, 2019 •

edited

Loading

rukai commented Jun 4, 2019 •

edited

Loading