Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FLIP fluid simulation example #145

Open
wants to merge 84 commits into
base: master
Choose a base branch
from

Conversation

keptsecret
Copy link
Contributor

Trial from the free task list for contributing to Nabla.

Performs a real-time FLIP fluid simulation, using compute shaders.

Comment on lines 46 to 72
[numthreads(WorkgroupSize, 1, 1)]
void updateNeighborFluidCells(uint32_t3 ID : SV_DispatchThreadID)
{
uint tid = ID.x;
int3 cIdx = flatIdxToCellIdx(tid, gridData.gridSize);

uint thisCellMaterial = getCellMaterial(cellMaterialInBuffer[tid]);
uint cellMaterial = 0;
setCellMaterial(cellMaterial, thisCellMaterial);

uint xpCm = cIdx.x == 0 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(-1, 0, 0), gridData.gridSize)]);
setXPrevMaterial(cellMaterial, xpCm);

uint xnCm = cIdx.x == gridData.gridSize.x - 1 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(1, 0, 0), gridData.gridSize)]);
setXNextMaterial(cellMaterial, xnCm);

uint ypCm = cIdx.y == 0 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(0, -1, 0), gridData.gridSize)]);
setYPrevMaterial(cellMaterial, ypCm);

uint ynCm = cIdx.y == gridData.gridSize.y - 1 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(0, 1, 0), gridData.gridSize)]);
setYNextMaterial(cellMaterial, ynCm);

uint zpCm = cIdx.z == 0 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(0, 0, -1), gridData.gridSize)]);
setZPrevMaterial(cellMaterial, zpCm);

uint znCm = cIdx.z == gridData.gridSize.z - 1 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(0, 0, 1), gridData.gridSize)]);
setZNextMaterial(cellMaterial, znCm);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, 3D dispatches!

what you're doing right now, is using the worst way to map 1D index to texel addresses causing cache misses when sampling neighbours

You don't benefit from workgroup touching neighbouring cells and caching, because you're doing 128 voxels in a straigh-ish line, which means you're tapping 1170 distinct cell values, a ratio of over 9.14

If you were using a 512 workgroup then you'd tap 4626 distinct cells giving you a ratio of 9.04

However if you used a 8x8x8 workgroup in a 3D dispatch, you'd only tap 1000 distinct cells, giving you 350% reduction in cache pressure!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also then you could you know.. preload everything to groupshared [10][10][10] and sample from there

Comment on lines 98 to 109
for (uint pid = idx.x; pid < idx.y; pid++)
{
Particle p = particleBuffer[pid];

float3 weight;
weight.x = getWeight(p.position.xyz, posvx, gridData.gridInvCellSize);
weight.y = getWeight(p.position.xyz, posvy, gridData.gridInvCellSize);
weight.z = getWeight(p.position.xyz, posvz, gridData.gridInvCellSize);

totalWeight += weight;
totalVel += weight * p.velocity.xyz;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has somewhat bad load balancing, there is another way to do this, ATOMICS

Basically you'd need to clear the velocity buffer to 0 in some previous dispatch and in updateFluidCells (since you're going through particles anyway) you'd do atomicAdd of the weighted velocity to the cell.

why this works?

  • you're supposed to split the velocity buffer into mono-channel anyway (3 images one per component)
  • emulated float atomic ops can be done via CAS-loops on R32_UINT underlying
  • there are also extensions for atomic float and atomic float16 (would have to check if we list them in SFeatures or SLimits of the physical device) if you feel like specializing with device traits

E.g. CAS loop atomic

float32_t oldValue = 0.f;
do
{
   float32_t expectedValue = oldValue;
   float32_t newValue = oldValue+weightedVelocity;
   uint32_t oldBitpattern;
   InterlockedCompareExchange(velocity[component][cIdx],asuint(expectedValue),asuint(newValue),oldBitpattern);
   oldValue = asfloat(oldBitpattern);
} while (oldValue!=expectedValue);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then if all the weighted accumulation happensin updateFluidCells this dispatch only needs to enforce the boundary condition so can be kernel-fusioned with the updateNeighbour dispatch

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep there should be device_traits for shaderBufferFloat16AtomicAdd, shaderBufferFloat32AtomicAdd, shaderImageFloat32AtomicAdd if you compile your code with ILogicalDevice

[[vk::binding(2, 1)]] RWStructuredBuffer<uint> globalHistograms;
[[vk::binding(3, 1)]] RWStructuredBuffer<uint> partitionHistogram;

groupshared uint localHistogram[NumSortBins];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your local histogram was so small that you could have prefix summed it in shared memory with a workgroup right away just like our counting sort example.

anyway with a Global sort over the whole image, that won't be possible anymore

Comment on lines 22 to 28
float3 velocity = velocityFieldBuffer[cIdx].xyz;
velocity += float3(0, -1, 0) * gravity * deltaTime;

enforceBoundaryCondition(velocity, cellMaterialBuffer[c_id]);

velocityFieldBuffer[cIdx].xyz = velocity;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could have been kernel fusioned with updateNeighbor

Comment on lines 26 to 88
[numthreads(WorkgroupSize, 1, 1)]
void setAxisCellMaterial(uint32_t3 ID : SV_DispatchThreadID)
{
uint cid = ID.x;
int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize);

uint cellMaterial = cellMaterialBuffer[cid];

uint this_cm = getCellMaterial(cellMaterial);
uint xp_cm = getXPrevMaterial(cellMaterial);
uint yp_cm = getYPrevMaterial(cellMaterial);
uint zp_cm = getZPrevMaterial(cellMaterial);

uint3 cellAxisType;
cellAxisType.x =
isSolidCell(this_cm) || isSolidCell(xp_cm) ? CM_SOLID :
isFluidCell(this_cm) || isFluidCell(xp_cm) ? CM_FLUID :
CM_AIR;
cellAxisType.y =
isSolidCell(this_cm) || isSolidCell(yp_cm) ? CM_SOLID :
isFluidCell(this_cm) || isFluidCell(yp_cm) ? CM_FLUID :
CM_AIR;
cellAxisType.z =
isSolidCell(this_cm) || isSolidCell(zp_cm) ? CM_SOLID :
isFluidCell(this_cm) || isFluidCell(zp_cm) ? CM_FLUID :
CM_AIR;

uint3 cmAxisTypes = 0;
setCellMaterial(cmAxisTypes, cellAxisType);

axisCellMaterialOutBuffer[cid].xyz = cmAxisTypes;
}

[numthreads(WorkgroupSize, 1, 1)]
void setNeighborAxisCellMaterial(uint32_t3 ID : SV_DispatchThreadID)
{
uint cid = ID.x;
int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize);

uint3 axisCm = (uint3)0;
uint3 this_axiscm = getCellMaterial(axisCellMaterialInBuffer[cid].xyz);
setCellMaterial(axisCm, this_axiscm);

uint3 xp_axiscm = cellIdx.x == 0 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(-1, 0, 0), gridData.gridSize)].xyz);
setXPrevMaterial(axisCm, xp_axiscm);

uint3 xn_axiscm = cellIdx.x == gridData.gridSize.x - 1 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(1, 0, 0), gridData.gridSize)].xyz);
setXNextMaterial(axisCm, xn_axiscm);

uint3 yp_axiscm = cellIdx.y == 0 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(0, -1, 0), gridData.gridSize)].xyz);
setYPrevMaterial(axisCm, yp_axiscm);

uint3 yn_axiscm = cellIdx.y == gridData.gridSize.y - 1 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(0, 1, 0), gridData.gridSize)].xyz);
setYNextMaterial(axisCm, yn_axiscm);

uint3 zp_axiscm = cellIdx.z == 0 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(0, 0, -1), gridData.gridSize)].xyz);
setZPrevMaterial(axisCm, zp_axiscm);

uint3 zn_axiscm = cellIdx.z == gridData.gridSize.z - 1 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(0, 0, 1), gridData.gridSize)].xyz);
setZNextMaterial(axisCm, zn_axiscm);

axisCellMaterialOutBuffer[cid].xyz = axisCm;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be kernel fused, if you used 8x8x8 workgroup, preloaded 10x10x10 (1 cell skirt) updated the axis cell in smem and then did the neighbor axis update on the inner 8x8x8 of the 10x10x10

See https://youtu.be/Ol_sHFVXvC0?si=RrP91pPpMzODMEDi&t=1585

Comment on lines 128 to 141
gridDiffusionOutBuffer[cid].xyz = velocity;
}

[numthreads(WorkgroupSize, 1, 1)]
void updateVelocity(uint32_t3 ID : SV_DispatchThreadID)
{
uint cid = ID.x;
int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize);

float3 velocity = gridDiffusionInBuffer[cid].xyz;

enforceBoundaryCondition(velocity, cellMaterialBuffer[cid]);

velocityFieldBuffer[cellIdx].xyz = velocity;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this can be kernel fused.

Also why do you have so many copies of velocity laying around?
I can count at least 4, do they all need to be live at the same time!?

Comment on lines 31 to 56
void calculateNegativeDivergence(uint32_t3 ID : SV_DispatchThreadID)
{
uint cid = ID.x;
int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize);

float3 param = (float3)gridData.gridInvCellSize;
float3 velocity = velocityFieldBuffer[cellIdx].xyz;

float divergence = 0;
if (isFluidCell(getCellMaterial(cellMaterialBuffer[cid])))
{
int3 cell_xn = cellIdx + int3(1, 0, 0);
uint cid_xn = cellIdxToFlatIdx(cell_xn, gridData.gridSize);
divergence += param.x * ((cell_xn.x < gridData.gridSize.x ? velocityFieldBuffer[cell_xn].x : 0.0f) - velocity.x);

int3 cell_yn = cellIdx + int3(0, 1, 0);
uint cid_yn = cellIdxToFlatIdx(cell_yn, gridData.gridSize);
divergence += param.y * ((cell_yn.y < gridData.gridSize.y ? velocityFieldBuffer[cell_yn].y : 0.0f) - velocity.y);

int3 cell_zn = cellIdx + int3(0, 0, 1);
uint cid_zn = cellIdxToFlatIdx(cell_zn, gridData.gridSize);
divergence += param.z * ((cell_zn.z < gridData.gridSize.z ? velocityFieldBuffer[cell_zn].z : 0.0f) - velocity.z);
}

divergenceBuffer[cid] = divergence;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be kernel fused with the updateVelocity and diffusion pipeline if you preaload a 1 cell border around 8x8x8 into shared memory and do some redundant work

Comment on lines 58 to 92
[numthreads(WorkgroupSize, 1, 1)]
void solvePressureSystem(uint32_t3 ID : SV_DispatchThreadID)
{
uint cid = ID.x;
int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize);

float pressure = 0;

uint cellMaterial = cellMaterialBuffer[cid];

if (isFluidCell(getCellMaterial(cellMaterial)))
{
uint cid_xp = cellIdxToFlatIdx(cellIdx + int3(-1, 0, 0), gridData.gridSize);
cid_xp = isSolidCell(getXPrevMaterial(cellMaterial)) ? cid : cid_xp;
uint cid_xn = cellIdxToFlatIdx(cellIdx + int3(1, 0, 0), gridData.gridSize);
cid_xn = isSolidCell(getXNextMaterial(cellMaterial)) ? cid : cid_xn;
pressure += params.coeff1.x * (pressureInBuffer[cid_xp] + pressureInBuffer[cid_xn]);

uint cid_yp = cellIdxToFlatIdx(cellIdx + int3(0, -1, 0), gridData.gridSize);
cid_yp = isSolidCell(getYPrevMaterial(cellMaterial)) ? cid : cid_yp;
uint cid_yn = cellIdxToFlatIdx(cellIdx + int3(0, 1, 0), gridData.gridSize);
cid_yn = isSolidCell(getYNextMaterial(cellMaterial)) ? cid : cid_yn;
pressure += params.coeff1.y * (pressureInBuffer[cid_yp] + pressureInBuffer[cid_yn]);

uint cid_zp = cellIdxToFlatIdx(cellIdx + int3(0, 0, -1), gridData.gridSize);
cid_zp = isSolidCell(getZPrevMaterial(cellMaterial)) ? cid : cid_zp;
uint cid_zn = cellIdxToFlatIdx(cellIdx + int3(0, 0, 1), gridData.gridSize);
cid_zn = isSolidCell(getZNextMaterial(cellMaterial)) ? cid : cid_zn;
pressure += params.coeff1.z * (pressureInBuffer[cid_zp] + pressureInBuffer[cid_zn]);

pressure += params.coeff1.w * divergenceBuffer[cid];
}

pressureOutBuffer[cid] = pressure;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a 8x8x8 dispatch has 512 threads.

Recent Nvidia GPUs can have 2,3 or 4 of those resident on an SM at the same time.

Vulkan requires implementations supports at least 16kb of shared memory per workgroup, but in practice all desktop GPUs let you use 32kb.

It just so happens that the old NV GPU arch (Pascal) that can have max 2 workgroups co-resident has at least 64kb of smem, so enough to support 2 workgroups using 32kb each, and newer GPUs have even more.

You are accessing:

  • divergence scalar
  • pressure scalar
  • solid testing

32kb is enough to store 5461 cells of this data if you use float16_t, and 3276 if you use float32_t for the scalars.

These mean 17^3 or 14^3 grids are possible to keep in shared memory. *

Meaning you can do (17-8)/2=4 or (14-8)/2=3 iterations of pressure solving in a single dispatch.

@devshgraphicsprogramming
Copy link
Member

devshgraphicsprogramming commented Oct 12, 2024

What I want you do do:

  1. Share Descriptor Sets and Layouts between related dispatches (esp the ones I said could be kernel fused) so there's less of them
  2. Don't transition your 3D images back and forth between GENERAL and READ_ONLY
  3. Make sure everything that's a "grid" is a 3D Image (one image per vector component, RGBA formats waste alpha) in a descriptor array binding and not a Buffer
  4. Use 3D dispatches and reduce the 1D -> 3D address conversions (especially ones with integer divisions and modulo shouldn't happen, you will never need to go cell -> particle)
  5. Use float atomics (you can require the feature) to accumulate weighted particle velocities and get rid of the particle sorts and per grid cell particle lists
  6. Move from SSBO and Dynamic UBO to BDA and make sure to use SoA for particles
  7. Cut down/elimanate useless velocity buffer copies (you have like 4 floating about) which don't need to be live at the same time
  8. See if you can leverage red-black ordering for some of your grid updates so you don't need an in and out buffer (esp the pressure solver)

The shared memory and kernel fusion improvements we can leave for someone else as their recruitment task in the future. Unless you find some the fusions as trivial as I do.

Nsight profiling might be nice to do before/after changes to see if we're winning and how much.

P.S. The only time you should use a 1D dispatch is when you're going over the particle list, so you can map particleIx = gl_GlobalInvocationID.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants