Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

my project is ~6x slower + validation errors #207

Closed
rukai opened this issue Jun 3, 2019 · 16 comments
Closed

my project is ~6x slower + validation errors #207

rukai opened this issue Jun 3, 2019 · 16 comments
Labels
type: question Further information is requested

Comments

@rukai
Copy link
Contributor

rukai commented Jun 3, 2019

It used to take 1m35s to generate a specific set of gifs.
Now the same code is taking 9m45s to generate the same set of gifs.

This seems to have been introduced part way through #197
As that is when I measured the 1m35s, and prior to that it seemed to progress at a similar rate until it panicked.

It seems like people have been able to resolve my issues before providing a repro. So if the problem is obvious then great, otherwise let me know and I'll try figure out what exactly is slower and get a repro together.

A gif of 120 frames takes ~4s to generate.
A gif of 71 frames takes ~2.37s to generate.

Its almost like its capped at 30fps??? Which doesn't make any sense as there is no swapchain involved.

Edit: Oh durr, remembered to check the validation layers and theres some errors.

VALIDATION [VUID_Undefined (0)] : vkWaitForFences: parameter fenceCount must be greater than 0.
object info: (type: UNKNOWN, hndl: 0)
@rukai rukai changed the title my project is ~6x slower my project is ~6x slower + validation errors Jun 3, 2019
@rukai
Copy link
Contributor Author

rukai commented Jun 3, 2019

It looks like its now really expensive to do things in the map_read_async callback. Did wgpu used to not block on it but now does?

If I keep the contents of the callback empty then the slowdown disappears.

@kvark
Copy link
Member

kvark commented Jun 3, 2019

@rukai thank you for the issue! I don't think #197 changed anything w.r.t. read synchronization. Perhaps, you've been using it wrong, and we didn't detect it properly, and now we do?

Are you making sure that the buffers you ask for mapping are actually done processing on the GPU?

If I keep the contents of the callback empty then the slowdown disappears.

That is weird. If it was the sync, you'd still see the slowdown even with an empty callback. What exactly is going on in the callback you have?

@rukai
Copy link
Contributor Author

rukai commented Jun 3, 2019

uh...........

        let width = 400;
        let height = 300;
        framebuffer_out.map_read_async(0, width as u64 * height as u64 * 4, move |result: wgpu::BufferMapAsyncResult<&[u32]>| {
            match result {
                Ok(data_u32) => {
                    println!("foo: {}", frame_index);
                    assert_eq!(width as usize * height as usize, data_u32.data.len());
                    let mut data_u8: Vec<u8> = vec!(0; width as usize * height as usize * 4);
                    for (i, value) in data_u32.data.iter().enumerate() {
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                        data_u8[i] = *value as u8;
                    }
                    frames_tx.send(data_u8).unwrap();
                }
                Err(error) => {
                    panic!("map_read_async failed: {:?}", error); // We have to panic here to avoid an infinite loop :/
                }
            }
        });

So this is inexplicably extremely expensive. Maybe rust is failing to optimize it now for some reason?
The more data_u8[i] = *value as u8 that I add the slower it gets.

(This is the real code https://github.com/rukai/brawllib_rs/blob/e6f659802a45bd5e4e8aee56827a98dbf202c395/src/renderer.rs#L185)

"Are you making sure that the buffers you ask for mapping are actually done processing on the GPU?"
No?
I just set up the map_read_async assuming it will be run as soon as its ready to be run.
Is there something else I should be doing?

At this point I think I should try to come up with a proper runnable repro.

I need to sleep now though

@kvark
Copy link
Member

kvark commented Jun 3, 2019

If you request a mapping, and the buffer is in use, then you can't use this buffer until the callback is issued. Your callback appears to do quite a bit of heavy lifting, and we can't do much while it's happening. Ideally, you'd not do this work in the callback, although this may require gfx-rs/wgpu-rs#9 API changes to become an option (cc @nical).

Another thing that would help is ensuring GPU is busy while your callback is happening on CPU.

@kvark kvark added the type: question Further information is requested label Jun 3, 2019
bors bot added a commit that referenced this issue Jun 3, 2019
208: Ensure wait_for_fences is never called on 0 fences r=kvark a=rukai

I don't see any reason why self.active should never be empty.

This fixes the validation errors but not the slowdown of #207
I presume the slowdown was causing self.active to clear.

Co-authored-by: Rukai <[email protected]>
@rukai
Copy link
Contributor Author

rukai commented Jun 4, 2019

Either wgpu is giving me cursed memory that is insanely slow to read from.
Or rust is now creating insanely slow code due to a change in wgpu.

I am aware how little sense either of those options make.
But consider the below code

                    assert_eq!(width as usize * height as usize, data_u32.data.len());
                    let instant = std::time::Instant::now();
                    let mut data_u8: Vec<u8> = vec!(0; width as usize * height as usize * 4);
                    for (i, value) in data_u32.data.iter().enumerate() {
                        data_u8[i * 4 + 0] = ((*value & 0x00FF0000) >> 16) as u8;
                        data_u8[i * 4 + 1] = ((*value & 0x0000FF00) >> 08) as u8;
                        data_u8[i * 4 + 2] = ((*value & 0x000000FF) >> 00) as u8;
                        data_u8[i * 4 + 3] = ((*value & 0xFF000000) >> 24) as u8;
                    }
                    println!("{:?}", instant.elapsed());

When it is in the wgpu callback it takes 32ms to complete. Adding more data_u8[i * 4 + 2] = .... lines increases time taken linearly.

If I take the same for loop out into a separate project it reports ~350µs

@kvark
Copy link
Member

kvark commented Jun 4, 2019 via email

@zakarumych
Copy link

zakarumych commented Jun 4, 2019

Uncached memory is slow to read. What memory usage is specified for allocation? Download memory should be cached and fitting for reads. And Upload usage avoids cached memory types.

BTW. If for wgpu cases set of memory usage types is limiting - allocation function takes impl MemoryUsage so one can make custom memory usage type.

@zakarumych
Copy link

zakarumych commented Jun 4, 2019

If buffer was created for both read and write then here

if !desc.usage.intersects(Bu::MAP_READ | Bu::MAP_WRITE) {
  Muv::Data
} else if (Bu::MAP_WRITE | Bu::TRANSFER_SRC).contains(desc.usage) {
  Muv::Upload
} else if (Bu::MAP_READ | Bu::TRANSFER_DST).contains(desc.usage) {
  Muv::Download
} else {
  Muv::Dynamic
}

It will pick Upload over Download. Should be vice versa I think.
And I should either add this to documentation, or add memory usage type for read'n'write.

@rukai
Copy link
Contributor Author

rukai commented Jun 4, 2019

I just figured out I can reduce the time from 32ms to 1.3ms by calling .to_vec() on the slice I am given first, and then reading from the vec instead of the slice. This is suitable for my needs, I'm now bottlenecked elsewhere.

But I'm not reading much data, just a 400x300 image of u32 colors. So it might still be too slow in general?
Using an image size of 1024x768 and the to_vec() method, takes ~10ms

@rukai
Copy link
Contributor Author

rukai commented Jun 4, 2019

Oh I get what omni-viral is saying now.
I tried swapping the order of

} else if (Bu::MAP_WRITE | Bu::TRANSFER_SRC).contains(desc.usage) {
  Muv::Upload

and

} else if (Bu::MAP_READ | Bu::TRANSFER_DST).contains(desc.usage) {
  Muv::Download

But it had no effect.

@zakarumych
Copy link

But it had no effect.

Without to_vec thing?

@rukai
Copy link
Contributor Author

rukai commented Jun 4, 2019

Yes, without to_vec it had no effect.

@kvark
Copy link
Member

kvark commented Jun 4, 2019 via email

@zakarumych
Copy link

zakarumych commented Jun 4, 2019

Dynamic is not really great choice for BufferUsage:all() as it also avoids cached memory so reads are awful.

I suggest implementing rendy::memory::MemoryUsage for BufferUsage to consider all set flags appropriately.

@rukai
Copy link
Contributor Author

rukai commented Jun 4, 2019

Aha, changing my framebuffer to use wgpu::BufferUsage::MAP_READ | wgpu::BufferUsage::TRANSFER_DST and using rendy master branch, it now completes in ~250µs. Much better!

The examples don't even use BufferUsage::all(), so there is nothing to change there, I must have just added it myself while initially trying to get things working.
Hopefully in the future, people will be able realize the cause themselves, because they have just written the needless BufferUsage. For me it was harder to reason about, because it was only fast before due to a bug in wgpu, and then suddenly not fast.

I think this issue should be good to close, unless it should be kept open to track other issues that have been uncovered?

@kvark
Copy link
Member

kvark commented Jun 4, 2019

Created #210 to follow up

@kvark kvark closed this as completed Jun 4, 2019
kvark pushed a commit to kvark/wgpu that referenced this issue Jun 3, 2021
208: Remove lifetime from SwapChainOutput r=kvark a=lachlansneff

The pr removes the lifetime from `SwapChainOutput`. It's not clear to me if the lifetime is necessary for soundness reasons, so let me know if so.

Resolves gfx-rs#207 

Co-authored-by: Lachlan Sneff <[email protected]>
Patryk27 pushed a commit to Patryk27/wgpu that referenced this issue Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants