-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FLIP fluid simulation example #145
base: master
Are you sure you want to change the base?
FLIP fluid simulation example #145
Conversation
[numthreads(WorkgroupSize, 1, 1)] | ||
void updateNeighborFluidCells(uint32_t3 ID : SV_DispatchThreadID) | ||
{ | ||
uint tid = ID.x; | ||
int3 cIdx = flatIdxToCellIdx(tid, gridData.gridSize); | ||
|
||
uint thisCellMaterial = getCellMaterial(cellMaterialInBuffer[tid]); | ||
uint cellMaterial = 0; | ||
setCellMaterial(cellMaterial, thisCellMaterial); | ||
|
||
uint xpCm = cIdx.x == 0 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(-1, 0, 0), gridData.gridSize)]); | ||
setXPrevMaterial(cellMaterial, xpCm); | ||
|
||
uint xnCm = cIdx.x == gridData.gridSize.x - 1 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(1, 0, 0), gridData.gridSize)]); | ||
setXNextMaterial(cellMaterial, xnCm); | ||
|
||
uint ypCm = cIdx.y == 0 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(0, -1, 0), gridData.gridSize)]); | ||
setYPrevMaterial(cellMaterial, ypCm); | ||
|
||
uint ynCm = cIdx.y == gridData.gridSize.y - 1 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(0, 1, 0), gridData.gridSize)]); | ||
setYNextMaterial(cellMaterial, ynCm); | ||
|
||
uint zpCm = cIdx.z == 0 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(0, 0, -1), gridData.gridSize)]); | ||
setZPrevMaterial(cellMaterial, zpCm); | ||
|
||
uint znCm = cIdx.z == gridData.gridSize.z - 1 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(0, 0, 1), gridData.gridSize)]); | ||
setZNextMaterial(cellMaterial, znCm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, 3D dispatches!
what you're doing right now, is using the worst way to map 1D index to texel addresses causing cache misses when sampling neighbours
You don't benefit from workgroup touching neighbouring cells and caching, because you're doing 128 voxels in a straigh-ish line, which means you're tapping 1170 distinct cell values, a ratio of over 9.14
If you were using a 512 workgroup then you'd tap 4626 distinct cells giving you a ratio of 9.04
However if you used a 8x8x8 workgroup in a 3D dispatch, you'd only tap 1000 distinct cells, giving you 350% reduction in cache pressure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also then you could you know.. preload everything to groupshared [10][10][10]
and sample from there
for (uint pid = idx.x; pid < idx.y; pid++) | ||
{ | ||
Particle p = particleBuffer[pid]; | ||
|
||
float3 weight; | ||
weight.x = getWeight(p.position.xyz, posvx, gridData.gridInvCellSize); | ||
weight.y = getWeight(p.position.xyz, posvy, gridData.gridInvCellSize); | ||
weight.z = getWeight(p.position.xyz, posvz, gridData.gridInvCellSize); | ||
|
||
totalWeight += weight; | ||
totalVel += weight * p.velocity.xyz; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has somewhat bad load balancing, there is another way to do this, ATOMICS
Basically you'd need to clear the velocity buffer to 0 in some previous dispatch and in updateFluidCells
(since you're going through particles anyway) you'd do atomicAdd of the weighted velocity to the cell.
why this works?
- you're supposed to split the velocity buffer into mono-channel anyway (3 images one per component)
- emulated float atomic ops can be done via CAS-loops on R32_UINT underlying
- there are also extensions for atomic float and atomic float16 (would have to check if we list them in SFeatures or SLimits of the physical device) if you feel like specializing with device traits
E.g. CAS loop atomic
float32_t oldValue = 0.f;
do
{
float32_t expectedValue = oldValue;
float32_t newValue = oldValue+weightedVelocity;
uint32_t oldBitpattern;
InterlockedCompareExchange(velocity[component][cIdx],asuint(expectedValue),asuint(newValue),oldBitpattern);
oldValue = asfloat(oldBitpattern);
} while (oldValue!=expectedValue);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then if all the weighted accumulation happensin updateFluidCells
this dispatch only needs to enforce the boundary condition so can be kernel-fusioned with the updateNeighbour
dispatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep there should be device_traits
for shaderBufferFloat16AtomicAdd
, shaderBufferFloat32AtomicAdd
, shaderImageFloat32AtomicAdd
if you compile your code with ILogicalDevice
[[vk::binding(2, 1)]] RWStructuredBuffer<uint> globalHistograms; | ||
[[vk::binding(3, 1)]] RWStructuredBuffer<uint> partitionHistogram; | ||
|
||
groupshared uint localHistogram[NumSortBins]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your local histogram was so small that you could have prefix summed it in shared memory with a workgroup right away just like our counting sort example.
anyway with a Global sort over the whole image, that won't be possible anymore
float3 velocity = velocityFieldBuffer[cIdx].xyz; | ||
velocity += float3(0, -1, 0) * gravity * deltaTime; | ||
|
||
enforceBoundaryCondition(velocity, cellMaterialBuffer[c_id]); | ||
|
||
velocityFieldBuffer[cIdx].xyz = velocity; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could have been kernel fusioned with updateNeighbor
[numthreads(WorkgroupSize, 1, 1)] | ||
void setAxisCellMaterial(uint32_t3 ID : SV_DispatchThreadID) | ||
{ | ||
uint cid = ID.x; | ||
int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize); | ||
|
||
uint cellMaterial = cellMaterialBuffer[cid]; | ||
|
||
uint this_cm = getCellMaterial(cellMaterial); | ||
uint xp_cm = getXPrevMaterial(cellMaterial); | ||
uint yp_cm = getYPrevMaterial(cellMaterial); | ||
uint zp_cm = getZPrevMaterial(cellMaterial); | ||
|
||
uint3 cellAxisType; | ||
cellAxisType.x = | ||
isSolidCell(this_cm) || isSolidCell(xp_cm) ? CM_SOLID : | ||
isFluidCell(this_cm) || isFluidCell(xp_cm) ? CM_FLUID : | ||
CM_AIR; | ||
cellAxisType.y = | ||
isSolidCell(this_cm) || isSolidCell(yp_cm) ? CM_SOLID : | ||
isFluidCell(this_cm) || isFluidCell(yp_cm) ? CM_FLUID : | ||
CM_AIR; | ||
cellAxisType.z = | ||
isSolidCell(this_cm) || isSolidCell(zp_cm) ? CM_SOLID : | ||
isFluidCell(this_cm) || isFluidCell(zp_cm) ? CM_FLUID : | ||
CM_AIR; | ||
|
||
uint3 cmAxisTypes = 0; | ||
setCellMaterial(cmAxisTypes, cellAxisType); | ||
|
||
axisCellMaterialOutBuffer[cid].xyz = cmAxisTypes; | ||
} | ||
|
||
[numthreads(WorkgroupSize, 1, 1)] | ||
void setNeighborAxisCellMaterial(uint32_t3 ID : SV_DispatchThreadID) | ||
{ | ||
uint cid = ID.x; | ||
int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize); | ||
|
||
uint3 axisCm = (uint3)0; | ||
uint3 this_axiscm = getCellMaterial(axisCellMaterialInBuffer[cid].xyz); | ||
setCellMaterial(axisCm, this_axiscm); | ||
|
||
uint3 xp_axiscm = cellIdx.x == 0 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(-1, 0, 0), gridData.gridSize)].xyz); | ||
setXPrevMaterial(axisCm, xp_axiscm); | ||
|
||
uint3 xn_axiscm = cellIdx.x == gridData.gridSize.x - 1 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(1, 0, 0), gridData.gridSize)].xyz); | ||
setXNextMaterial(axisCm, xn_axiscm); | ||
|
||
uint3 yp_axiscm = cellIdx.y == 0 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(0, -1, 0), gridData.gridSize)].xyz); | ||
setYPrevMaterial(axisCm, yp_axiscm); | ||
|
||
uint3 yn_axiscm = cellIdx.y == gridData.gridSize.y - 1 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(0, 1, 0), gridData.gridSize)].xyz); | ||
setYNextMaterial(axisCm, yn_axiscm); | ||
|
||
uint3 zp_axiscm = cellIdx.z == 0 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(0, 0, -1), gridData.gridSize)].xyz); | ||
setZPrevMaterial(axisCm, zp_axiscm); | ||
|
||
uint3 zn_axiscm = cellIdx.z == gridData.gridSize.z - 1 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(0, 0, 1), gridData.gridSize)].xyz); | ||
setZNextMaterial(axisCm, zn_axiscm); | ||
|
||
axisCellMaterialOutBuffer[cid].xyz = axisCm; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be kernel fused, if you used 8x8x8 workgroup, preloaded 10x10x10 (1 cell skirt) updated the axis cell in smem and then did the neighbor axis update on the inner 8x8x8 of the 10x10x10
gridDiffusionOutBuffer[cid].xyz = velocity; | ||
} | ||
|
||
[numthreads(WorkgroupSize, 1, 1)] | ||
void updateVelocity(uint32_t3 ID : SV_DispatchThreadID) | ||
{ | ||
uint cid = ID.x; | ||
int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize); | ||
|
||
float3 velocity = gridDiffusionInBuffer[cid].xyz; | ||
|
||
enforceBoundaryCondition(velocity, cellMaterialBuffer[cid]); | ||
|
||
velocityFieldBuffer[cellIdx].xyz = velocity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this can be kernel fused.
Also why do you have so many copies of velocity laying around?
I can count at least 4, do they all need to be live at the same time!?
void calculateNegativeDivergence(uint32_t3 ID : SV_DispatchThreadID) | ||
{ | ||
uint cid = ID.x; | ||
int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize); | ||
|
||
float3 param = (float3)gridData.gridInvCellSize; | ||
float3 velocity = velocityFieldBuffer[cellIdx].xyz; | ||
|
||
float divergence = 0; | ||
if (isFluidCell(getCellMaterial(cellMaterialBuffer[cid]))) | ||
{ | ||
int3 cell_xn = cellIdx + int3(1, 0, 0); | ||
uint cid_xn = cellIdxToFlatIdx(cell_xn, gridData.gridSize); | ||
divergence += param.x * ((cell_xn.x < gridData.gridSize.x ? velocityFieldBuffer[cell_xn].x : 0.0f) - velocity.x); | ||
|
||
int3 cell_yn = cellIdx + int3(0, 1, 0); | ||
uint cid_yn = cellIdxToFlatIdx(cell_yn, gridData.gridSize); | ||
divergence += param.y * ((cell_yn.y < gridData.gridSize.y ? velocityFieldBuffer[cell_yn].y : 0.0f) - velocity.y); | ||
|
||
int3 cell_zn = cellIdx + int3(0, 0, 1); | ||
uint cid_zn = cellIdxToFlatIdx(cell_zn, gridData.gridSize); | ||
divergence += param.z * ((cell_zn.z < gridData.gridSize.z ? velocityFieldBuffer[cell_zn].z : 0.0f) - velocity.z); | ||
} | ||
|
||
divergenceBuffer[cid] = divergence; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be kernel fused with the updateVelocity and diffusion pipeline if you preaload a 1 cell border around 8x8x8 into shared memory and do some redundant work
[numthreads(WorkgroupSize, 1, 1)] | ||
void solvePressureSystem(uint32_t3 ID : SV_DispatchThreadID) | ||
{ | ||
uint cid = ID.x; | ||
int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize); | ||
|
||
float pressure = 0; | ||
|
||
uint cellMaterial = cellMaterialBuffer[cid]; | ||
|
||
if (isFluidCell(getCellMaterial(cellMaterial))) | ||
{ | ||
uint cid_xp = cellIdxToFlatIdx(cellIdx + int3(-1, 0, 0), gridData.gridSize); | ||
cid_xp = isSolidCell(getXPrevMaterial(cellMaterial)) ? cid : cid_xp; | ||
uint cid_xn = cellIdxToFlatIdx(cellIdx + int3(1, 0, 0), gridData.gridSize); | ||
cid_xn = isSolidCell(getXNextMaterial(cellMaterial)) ? cid : cid_xn; | ||
pressure += params.coeff1.x * (pressureInBuffer[cid_xp] + pressureInBuffer[cid_xn]); | ||
|
||
uint cid_yp = cellIdxToFlatIdx(cellIdx + int3(0, -1, 0), gridData.gridSize); | ||
cid_yp = isSolidCell(getYPrevMaterial(cellMaterial)) ? cid : cid_yp; | ||
uint cid_yn = cellIdxToFlatIdx(cellIdx + int3(0, 1, 0), gridData.gridSize); | ||
cid_yn = isSolidCell(getYNextMaterial(cellMaterial)) ? cid : cid_yn; | ||
pressure += params.coeff1.y * (pressureInBuffer[cid_yp] + pressureInBuffer[cid_yn]); | ||
|
||
uint cid_zp = cellIdxToFlatIdx(cellIdx + int3(0, 0, -1), gridData.gridSize); | ||
cid_zp = isSolidCell(getZPrevMaterial(cellMaterial)) ? cid : cid_zp; | ||
uint cid_zn = cellIdxToFlatIdx(cellIdx + int3(0, 0, 1), gridData.gridSize); | ||
cid_zn = isSolidCell(getZNextMaterial(cellMaterial)) ? cid : cid_zn; | ||
pressure += params.coeff1.z * (pressureInBuffer[cid_zp] + pressureInBuffer[cid_zn]); | ||
|
||
pressure += params.coeff1.w * divergenceBuffer[cid]; | ||
} | ||
|
||
pressureOutBuffer[cid] = pressure; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a 8x8x8 dispatch has 512 threads.
Recent Nvidia GPUs can have 2,3 or 4 of those resident on an SM at the same time.
Vulkan requires implementations supports at least 16kb of shared memory per workgroup, but in practice all desktop GPUs let you use 32kb.
It just so happens that the old NV GPU arch (Pascal) that can have max 2 workgroups co-resident has at least 64kb of smem, so enough to support 2 workgroups using 32kb each, and newer GPUs have even more.
You are accessing:
- divergence scalar
- pressure scalar
- solid testing
32kb is enough to store 5461 cells of this data if you use float16_t, and 3276 if you use float32_t for the scalars.
These mean 17^3 or 14^3 grids are possible to keep in shared memory. *
Meaning you can do (17-8)/2=4 or (14-8)/2=3 iterations of pressure solving in a single dispatch.
- How is it possible to do only have one copy of your buffers without ping-ponging? Red-Black Ordering!
https://www.youtube.com/watch?app=desktop&v=giTZ89q-Bpk
What I want you do do:
The shared memory and kernel fusion improvements we can leave for someone else as their recruitment task in the future. Unless you find some the fusions as trivial as I do. Nsight profiling might be nice to do before/after changes to see if we're winning and how much. P.S. The only time you should use a 1D dispatch is when you're going over the particle list, so you can map |
Trial from the free task list for contributing to Nabla.
Performs a real-time FLIP fluid simulation, using compute shaders.