-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work on property pool HLSL impl #649
base: master
Are you sure you want to change the base?
Conversation
uint64_t endOffset; | ||
}; | ||
|
||
NBL_CONSTEXPR uint32_t MaxPropertiesPerDispatch = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any reason to keep this around anymore?
// Define the range of invocations (X axis) that will be transfered over in this dispatch | ||
// May be sectioned off in the case of overflow or any other situation that doesn't allow | ||
// for a full transfer | ||
uint64_t beginOffset; | ||
uint64_t endOffset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be useful to make it clear we're counting in DWORDs or shorts (if you want to do 16bit transfer atoms instead)
template<bool Fill, bool SrcIndexIota, bool DstIndexIota> | ||
struct TransferLoopPermutationDstIota | ||
{ | ||
void copyLoop(uint baseInvocationIndex, uint propertyId, TransferRequest transferRequest, uint dispatchSize) | ||
{ | ||
if (transferRequest.srcIndexSizeLog2 == 0) { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota, 0> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); } | ||
else if (transferRequest.srcIndexSizeLog2 == 1) { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota, 1> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); } | ||
else if (transferRequest.srcIndexSizeLog2 == 2) { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota, 2> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); } | ||
else /*if (transferRequest.srcIndexSizeLog2 == 3)*/ { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota, 3> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); } | ||
} | ||
}; | ||
|
||
template<bool Fill, bool SrcIndexIota> | ||
struct TransferLoopPermutationSrcIota | ||
{ | ||
void copyLoop(uint baseInvocationIndex, uint propertyId, TransferRequest transferRequest, uint dispatchSize) | ||
{ | ||
bool dstIota = transferRequest.dstIndexAddr == 0; | ||
if (dstIota) { TransferLoopPermutationDstIota<Fill, SrcIndexIota, true> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); } | ||
else { TransferLoopPermutationDstIota<Fill, SrcIndexIota, false> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); } | ||
} | ||
}; | ||
|
||
template<bool Fill> | ||
struct TransferLoopPermutationFill | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only use structs instead of templated functions when you need partial specialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The struct functor only makes sense if:
- you do a partial specialization, i.e.
template<typename ArbitraryType> struct MyFunctor<true,ArbitraryType>
because functions can be only fully specialized - you need to pass the functor as a template arg/lambda because HLSL202x doesn't allow function pointers/references or you want a stateful functor
template<typename Accessor, typename Compare>
uint32_t find_first(inout Accessor accessor, const Compare comparator);
if neither of the above applies, just use a templated function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, I think your original comment was the wrong way around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, I think your original comment was the wrong way around
"only use structs instead of templated functions when you need partial specialization"
- you don't need partial specialization
- you're using structs
- but you only use structs when you have partial specialization
- dont use structs here.
static inline constexpr auto invalid = PropertyAddressAllocator::invalid_address; | ||
|
||
static inline constexpr uint64_t invalid = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when did we agree to change the invalid
from the AddressAllocator invalid (0xdeadbeefu) to 0?
// TODO: Reuse asset manager from elsewhere? | ||
auto assetManager = core::make_smart_refctd_ptr<asset::IAssetManager>(core::smart_refctd_ptr<system::ISystem>(system)); | ||
|
||
auto loadShader = [&](const char* path) | ||
{ | ||
asset::IAssetLoader::SAssetLoadParams params = {}; | ||
auto assetBundle = assetManager->getAsset(path, params); | ||
auto assets = assetBundle.getContents(); | ||
assert(!assets.empty()); | ||
|
||
auto cpuShader = asset::IAsset::castDown<asset::ICPUShader>(assets[0]); | ||
auto shader = m_device->createShader(cpuShader.get()); | ||
return shader; | ||
}; | ||
auto shader = loadShader("../../../include/nbl/builtin/hlsl/property_pool/copy.comp.hlsl"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather use system->openFile
and use the nbl/builtin/hlsl/property_pool/copy.comp.hlsl
path directly toi opena memory mapped file and create IGPUShader
from it directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at the old loadBuiltinData
code
transferRequest.srcIndexAddr = srcRequest->srcAddressesOffset ? addressBufferDeviceAddr + srcRequest->srcAddressesOffset : 0; | ||
transferRequest.dstIndexAddr = srcRequest->dstAddressesOffset ? addressBufferDeviceAddr + srcRequest->dstAddressesOffset : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I told you explicitly to get rid of the addresses
buffer, and just have each transferRequest supply the correctly offsetted BDA or SBufferBinding instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given how you're doing srcAddr
and dstAddr
, the SBufferBinding
route is preferrable
uint32_t maxScratchSize = MaxPropertiesPerDispatch * sizeof(nbl::hlsl::property_pools::TransferRequest); | ||
if (scratch.offset + maxScratchSize > scratch.buffer->getSize()) | ||
logger.log("CPropertyPoolHandler: The scratch buffer binding provided might not be big enough in the worst case! (Scratch buffer size: %i Max scratch size: %i)", | ||
system::ILogger::ELL_WARNING, | ||
scratch.buffer->getSize() - scratch.offset, | ||
maxScratchSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd validate better, also right now your code will choke/crash on a scratch buffer thats too small, you're still using MaxPropertiesPerDispatch
to figure out the numberOfPasses
pushConstants.beginOffset = baseDWORD; | ||
pushConstants.endOffset = endDWORD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is why naming your variables matters, Offset == ByteOffset, but we're clearly counting in DWORDs?
m_indexToAddr = reinterpret_cast<uint32_t*>(reinterpret_cast<uint8_t*>(reserved)+getReservedSize(capacity)); | ||
m_indexToAddr = reinterpret_cast<uint64_t*>(reinterpret_cast<uint8_t*>(reserved)+getReservedSize(capacity)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't touch this up without making the reserved mem bigger!
// TODO: instead use some sort of replace function for getting optimal size? | ||
[numthreads(512,1,1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already wrote on discord to codegen a 5 line compute shader with main
in the C++ and leave copy.hlsl
as stage agnostic
transferRequest.srcAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr,8); | ||
transferRequest.dstAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr + sizeof(uint64_t),8); | ||
transferRequest.srcIndexAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr + sizeof(uint64_t) * 2,8); | ||
transferRequest.dstIndexAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr + sizeof(uint64_t) * 3,8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a wrapper for vk::RawBufferLoad
and Store
in nbl::hlsl::legacy
which uses our type traits to default the alignment
// Loading transfer request from the pointer (can't use struct | ||
// with BDA on HLSL SPIRV) | ||
static TransferRequest TransferRequest::newFromAddress(const uint64_t transferCmdAddr) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep it with the struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The struct is shared with c++ code, so i wouldn't be able to use vk::rawbufferread; I could take the 64 bit value though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use #ifndef __HLSL_VERSION
in the impl of the method
// TODO: instead use some sort of replace function for getting optimal size? | ||
NBL_CONSTEXPR uint32_t OptimalDispatchSize = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use the device JIT to query the max compute dispatch size, I'd round it down to nearest PoT though, so the divisions aren't expensive
c00ee34
to
7ac728b
Compare
Description
Implementing
CPropertyPoolHandler
andCPropertyPool
in HLSL, using direct buffer address instead of allocating descriptors sets for buffers.Notes about impl:
Testing
TODO list: