FLIP fluid simulation example #145

keptsecret · 2024-09-09T16:37:51Z

Trial from the free task list for contributing to Nabla.

Performs a real-time FLIP fluid simulation, using compute shaders.

devshgraphicsprogramming · 2024-10-11T20:41:49Z

70_FLIPFluids/app_resources/compute/updateFluidCells.comp.hlsl

+[numthreads(WorkgroupSize, 1, 1)]
+void updateNeighborFluidCells(uint32_t3 ID : SV_DispatchThreadID)
+{
+    uint tid = ID.x;
+    int3 cIdx = flatIdxToCellIdx(tid, gridData.gridSize);
+
+    uint thisCellMaterial = getCellMaterial(cellMaterialInBuffer[tid]);
+    uint cellMaterial = 0;
+    setCellMaterial(cellMaterial, thisCellMaterial);
+
+    uint xpCm = cIdx.x == 0 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(-1, 0, 0), gridData.gridSize)]);
+    setXPrevMaterial(cellMaterial, xpCm);
+
+    uint xnCm = cIdx.x == gridData.gridSize.x - 1 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(1, 0, 0), gridData.gridSize)]);
+    setXNextMaterial(cellMaterial, xnCm);
+
+    uint ypCm = cIdx.y == 0 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(0, -1, 0), gridData.gridSize)]);
+    setYPrevMaterial(cellMaterial, ypCm);
+
+    uint ynCm = cIdx.y == gridData.gridSize.y - 1 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(0, 1, 0), gridData.gridSize)]);
+    setYNextMaterial(cellMaterial, ynCm);
+
+    uint zpCm = cIdx.z == 0 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(0, 0, -1), gridData.gridSize)]);
+    setZPrevMaterial(cellMaterial, zpCm);
+
+    uint znCm = cIdx.z == gridData.gridSize.z - 1 ? CM_SOLID : getCellMaterial(cellMaterialInBuffer[cellIdxToFlatIdx(cIdx + int3(0, 0, 1), gridData.gridSize)]);
+    setZNextMaterial(cellMaterial, znCm);


again, 3D dispatches!

what you're doing right now, is using the worst way to map 1D index to texel addresses causing cache misses when sampling neighbours

You don't benefit from workgroup touching neighbouring cells and caching, because you're doing 128 voxels in a straigh-ish line, which means you're tapping 1170 distinct cell values, a ratio of over 9.14

If you were using a 512 workgroup then you'd tap 4626 distinct cells giving you a ratio of 9.04

However if you used a 8x8x8 workgroup in a 3D dispatch, you'd only tap 1000 distinct cells, giving you 350% reduction in cache pressure!

Also then you could you know.. preload everything to groupshared [10][10][10] and sample from there

devshgraphicsprogramming · 2024-10-11T21:02:37Z

70_FLIPFluids/app_resources/compute/updateFluidCells.comp.hlsl

+                for (uint pid = idx.x; pid < idx.y; pid++)
+                {
+                    Particle p = particleBuffer[pid];
+
+                    float3 weight;
+                    weight.x = getWeight(p.position.xyz, posvx, gridData.gridInvCellSize);
+                    weight.y = getWeight(p.position.xyz, posvy, gridData.gridInvCellSize);
+                    weight.z = getWeight(p.position.xyz, posvz, gridData.gridInvCellSize);
+
+                    totalWeight += weight;
+                    totalVel += weight * p.velocity.xyz;
+                }


this has somewhat bad load balancing, there is another way to do this, ATOMICS

Basically you'd need to clear the velocity buffer to 0 in some previous dispatch and in updateFluidCells (since you're going through particles anyway) you'd do atomicAdd of the weighted velocity to the cell.

why this works?

you're supposed to split the velocity buffer into mono-channel anyway (3 images one per component)

emulated float atomic ops can be done via CAS-loops on R32_UINT underlying

there are also extensions for atomic float and atomic float16 (would have to check if we list them in SFeatures or SLimits of the physical device) if you feel like specializing with device traits

E.g. CAS loop atomic

float32_t oldValue = 0.f; do { float32_t expectedValue = oldValue; float32_t newValue = oldValue+weightedVelocity; uint32_t oldBitpattern; InterlockedCompareExchange(velocity[component][cIdx],asuint(expectedValue),asuint(newValue),oldBitpattern); oldValue = asfloat(oldBitpattern); } while (oldValue!=expectedValue);

then if all the weighted accumulation happensin updateFluidCells this dispatch only needs to enforce the boundary condition so can be kernel-fusioned with the updateNeighbour dispatch

Yep there should be device_traits for shaderBufferFloat16AtomicAdd, shaderBufferFloat32AtomicAdd, shaderImageFloat32AtomicAdd if you compile your code with ILogicalDevice

devshgraphicsprogramming · 2024-10-11T21:11:41Z

70_FLIPFluids/app_resources/compute/radix_sort/buildHistogram.comp.hlsl

+[[vk::binding(2, 1)]] RWStructuredBuffer<uint> globalHistograms;
+[[vk::binding(3, 1)]] RWStructuredBuffer<uint> partitionHistogram;
+
+groupshared uint localHistogram[NumSortBins];


your local histogram was so small that you could have prefix summed it in shared memory with a workgroup right away just like our counting sort example.

anyway with a Global sort over the whole image, that won't be possible anymore

devshgraphicsprogramming · 2024-10-11T21:26:29Z

70_FLIPFluids/app_resources/compute/applyBodyForces.comp.hlsl

+    float3 velocity = velocityFieldBuffer[cIdx].xyz;
+    velocity += float3(0, -1, 0) * gravity * deltaTime;
+
+    enforceBoundaryCondition(velocity, cellMaterialBuffer[c_id]);
+
+    velocityFieldBuffer[cIdx].xyz = velocity;
+}


this could have been kernel fusioned with updateNeighbor

devshgraphicsprogramming · 2024-10-11T21:33:24Z

70_FLIPFluids/app_resources/compute/diffusion.comp.hlsl

+[numthreads(WorkgroupSize, 1, 1)]
+void setAxisCellMaterial(uint32_t3 ID : SV_DispatchThreadID)
+{
+    uint cid = ID.x;
+    int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize);
+
+    uint cellMaterial = cellMaterialBuffer[cid];
+
+    uint this_cm = getCellMaterial(cellMaterial);
+    uint xp_cm = getXPrevMaterial(cellMaterial);
+    uint yp_cm = getYPrevMaterial(cellMaterial);
+    uint zp_cm = getZPrevMaterial(cellMaterial);
+
+    uint3 cellAxisType;
+    cellAxisType.x = 
+        isSolidCell(this_cm) || isSolidCell(xp_cm) ? CM_SOLID :
+        isFluidCell(this_cm) || isFluidCell(xp_cm) ? CM_FLUID :
+        CM_AIR;
+    cellAxisType.y = 
+        isSolidCell(this_cm) || isSolidCell(yp_cm) ? CM_SOLID :
+        isFluidCell(this_cm) || isFluidCell(yp_cm) ? CM_FLUID :
+        CM_AIR;
+    cellAxisType.z = 
+        isSolidCell(this_cm) || isSolidCell(zp_cm) ? CM_SOLID :
+        isFluidCell(this_cm) || isFluidCell(zp_cm) ? CM_FLUID :
+        CM_AIR;
+
+    uint3 cmAxisTypes = 0;
+    setCellMaterial(cmAxisTypes, cellAxisType);
+
+    axisCellMaterialOutBuffer[cid].xyz = cmAxisTypes;
+}
+
+[numthreads(WorkgroupSize, 1, 1)]
+void setNeighborAxisCellMaterial(uint32_t3 ID : SV_DispatchThreadID)
+{
+    uint cid = ID.x;
+    int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize);
+
+    uint3 axisCm = (uint3)0;
+    uint3 this_axiscm = getCellMaterial(axisCellMaterialInBuffer[cid].xyz);
+    setCellMaterial(axisCm, this_axiscm);
+
+    uint3 xp_axiscm = cellIdx.x == 0 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(-1, 0, 0), gridData.gridSize)].xyz);
+    setXPrevMaterial(axisCm, xp_axiscm);
+
+    uint3 xn_axiscm = cellIdx.x == gridData.gridSize.x - 1 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(1, 0, 0), gridData.gridSize)].xyz);
+    setXNextMaterial(axisCm, xn_axiscm);
+
+    uint3 yp_axiscm = cellIdx.y == 0 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(0, -1, 0), gridData.gridSize)].xyz);
+    setYPrevMaterial(axisCm, yp_axiscm);
+
+    uint3 yn_axiscm = cellIdx.y == gridData.gridSize.y - 1 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(0, 1, 0), gridData.gridSize)].xyz);
+    setYNextMaterial(axisCm, yn_axiscm);
+
+    uint3 zp_axiscm = cellIdx.z == 0 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(0, 0, -1), gridData.gridSize)].xyz);
+    setZPrevMaterial(axisCm, zp_axiscm);
+
+    uint3 zn_axiscm = cellIdx.z == gridData.gridSize.z - 1 ? (uint3)CM_SOLID : getCellMaterial(axisCellMaterialInBuffer[cellIdxToFlatIdx(cellIdx + int3(0, 0, 1), gridData.gridSize)].xyz);
+    setZNextMaterial(axisCm, zn_axiscm);
+
+    axisCellMaterialOutBuffer[cid].xyz = axisCm;
+}


this could be kernel fused, if you used 8x8x8 workgroup, preloaded 10x10x10 (1 cell skirt) updated the axis cell in smem and then did the neighbor axis update on the inner 8x8x8 of the 10x10x10

See https://youtu.be/Ol_sHFVXvC0?si=RrP91pPpMzODMEDi&t=1585

devshgraphicsprogramming · 2024-10-12T10:48:13Z

70_FLIPFluids/app_resources/compute/diffusion.comp.hlsl

+    gridDiffusionOutBuffer[cid].xyz = velocity;
+}
+
+[numthreads(WorkgroupSize, 1, 1)]
+void updateVelocity(uint32_t3 ID : SV_DispatchThreadID)
+{
+    uint cid = ID.x;
+    int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize);
+
+    float3 velocity = gridDiffusionInBuffer[cid].xyz;
+
+    enforceBoundaryCondition(velocity, cellMaterialBuffer[cid]);
+
+    velocityFieldBuffer[cellIdx].xyz = velocity;


I'm pretty sure this can be kernel fused.

Also why do you have so many copies of velocity laying around?
I can count at least 4, do they all need to be live at the same time!?

devshgraphicsprogramming · 2024-10-12T10:51:12Z

70_FLIPFluids/app_resources/compute/pressureSolver.comp.hlsl

+void calculateNegativeDivergence(uint32_t3 ID : SV_DispatchThreadID)
+{
+    uint cid = ID.x;
+    int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize);
+
+    float3 param = (float3)gridData.gridInvCellSize;
+    float3 velocity = velocityFieldBuffer[cellIdx].xyz;
+
+    float divergence = 0;
+    if (isFluidCell(getCellMaterial(cellMaterialBuffer[cid])))
+    {
+        int3 cell_xn = cellIdx + int3(1, 0, 0);
+        uint cid_xn = cellIdxToFlatIdx(cell_xn, gridData.gridSize);
+        divergence += param.x * ((cell_xn.x < gridData.gridSize.x ? velocityFieldBuffer[cell_xn].x : 0.0f) - velocity.x);
+
+        int3 cell_yn = cellIdx + int3(0, 1, 0);
+        uint cid_yn = cellIdxToFlatIdx(cell_yn, gridData.gridSize);
+        divergence += param.y * ((cell_yn.y < gridData.gridSize.y ? velocityFieldBuffer[cell_yn].y : 0.0f) - velocity.y);
+
+        int3 cell_zn = cellIdx + int3(0, 0, 1);
+        uint cid_zn = cellIdxToFlatIdx(cell_zn, gridData.gridSize);
+        divergence += param.z * ((cell_zn.z < gridData.gridSize.z ? velocityFieldBuffer[cell_zn].z : 0.0f) - velocity.z);
+    }
+
+    divergenceBuffer[cid] = divergence;
+}


this can be kernel fused with the updateVelocity and diffusion pipeline if you preaload a 1 cell border around 8x8x8 into shared memory and do some redundant work

devshgraphicsprogramming · 2024-10-12T11:06:38Z

70_FLIPFluids/app_resources/compute/pressureSolver.comp.hlsl

+[numthreads(WorkgroupSize, 1, 1)]
+void solvePressureSystem(uint32_t3 ID : SV_DispatchThreadID)
+{
+    uint cid = ID.x;
+    int3 cellIdx = flatIdxToCellIdx(cid, gridData.gridSize);
+
+    float pressure = 0;
+
+    uint cellMaterial = cellMaterialBuffer[cid];
+
+    if (isFluidCell(getCellMaterial(cellMaterial)))
+    {
+        uint cid_xp = cellIdxToFlatIdx(cellIdx + int3(-1, 0, 0), gridData.gridSize);
+        cid_xp = isSolidCell(getXPrevMaterial(cellMaterial)) ? cid : cid_xp;
+        uint cid_xn = cellIdxToFlatIdx(cellIdx + int3(1, 0, 0), gridData.gridSize);
+        cid_xn = isSolidCell(getXNextMaterial(cellMaterial)) ? cid : cid_xn;
+        pressure += params.coeff1.x * (pressureInBuffer[cid_xp] + pressureInBuffer[cid_xn]);
+
+        uint cid_yp = cellIdxToFlatIdx(cellIdx + int3(0, -1, 0), gridData.gridSize);
+        cid_yp = isSolidCell(getYPrevMaterial(cellMaterial)) ? cid : cid_yp;
+        uint cid_yn = cellIdxToFlatIdx(cellIdx + int3(0, 1, 0), gridData.gridSize);
+        cid_yn = isSolidCell(getYNextMaterial(cellMaterial)) ? cid : cid_yn;
+        pressure += params.coeff1.y * (pressureInBuffer[cid_yp] + pressureInBuffer[cid_yn]);
+
+        uint cid_zp = cellIdxToFlatIdx(cellIdx + int3(0, 0, -1), gridData.gridSize);
+        cid_zp = isSolidCell(getZPrevMaterial(cellMaterial)) ? cid : cid_zp;
+        uint cid_zn = cellIdxToFlatIdx(cellIdx + int3(0, 0, 1), gridData.gridSize);
+        cid_zn = isSolidCell(getZNextMaterial(cellMaterial)) ? cid : cid_zn;
+        pressure += params.coeff1.z * (pressureInBuffer[cid_zp] + pressureInBuffer[cid_zn]);
+
+        pressure += params.coeff1.w * divergenceBuffer[cid];
+    }
+
+    pressureOutBuffer[cid] = pressure;
+}


a 8x8x8 dispatch has 512 threads.

Recent Nvidia GPUs can have 2,3 or 4 of those resident on an SM at the same time.

Vulkan requires implementations supports at least 16kb of shared memory per workgroup, but in practice all desktop GPUs let you use 32kb.

It just so happens that the old NV GPU arch (Pascal) that can have max 2 workgroups co-resident has at least 64kb of smem, so enough to support 2 workgroups using 32kb each, and newer GPUs have even more.

You are accessing:

divergence scalar

pressure scalar

solid testing

32kb is enough to store 5461 cells of this data if you use float16_t, and 3276 if you use float32_t for the scalars.

These mean 17^3 or 14^3 grids are possible to keep in shared memory. *

Meaning you can do (17-8)/2=4 or (14-8)/2=3 iterations of pressure solving in a single dispatch.

How is it possible to do only have one copy of your buffers without ping-ponging? Red-Black Ordering!
https://www.youtube.com/watch?app=desktop&v=giTZ89q-Bpk

devshgraphicsprogramming · 2024-10-12T11:15:41Z

What I want you do do:

Share Descriptor Sets and Layouts between related dispatches (esp the ones I said could be kernel fused) so there's less of them
Don't transition your 3D images back and forth between GENERAL and READ_ONLY
Make sure everything that's a "grid" is a 3D Image (one image per vector component, RGBA formats waste alpha) in a descriptor array binding and not a Buffer
Use 3D dispatches and reduce the 1D -> 3D address conversions (especially ones with integer divisions and modulo shouldn't happen, you will never need to go cell -> particle)
Use float atomics (you can require the feature) to accumulate weighted particle velocities and get rid of the particle sorts and per grid cell particle lists
Move from SSBO and Dynamic UBO to BDA and make sure to use SoA for particles
Cut down/elimanate useless velocity buffer copies (you have like 4 floating about) which don't need to be live at the same time
See if you can leverage red-black ordering for some of your grid updates so you don't need an in and out buffer (esp the pressure solver)

The shared memory and kernel fusion improvements we can leave for someone else as their recruitment task in the future. Unless you find some the fusions as trivial as I do.

Nsight profiling might be nice to do before/after changes to see if we're winning and how much.

P.S. The only time you should use a 1D dispatch is when you're going over the particle list, so you can map particleIx = gl_GlobalInvocationID.x

keptsecret added 30 commits July 16, 2024 15:06

init new example

dcaead3

test compute shader pipeline

dbf97a8

fixed bug and made compute pipeline func

1ca6ea2

test create buffer and init particles

b910ce5

Merge branch 'Devsh-Graphics-Programming:master' into master

b4125e6

required utils and show window

8f267c8

fixed shader and removed unused funcs

fbb696d

test struct vector types in hlsl

37d9ac4

shaders to render particles

f463a0b

fix bugs, render still not working

98fee14

first test particle render

7001b1e

fix errors at runtime, still broken

18b7cc9

fixed most rendering pipeline bugs

29db701

added pipeline barriers, on-demand particle init

eaf034a

working compute shader render

3a4fc2d

fix particle color

d89692b

test new particle shader

0a7845a

more utils

0b38b07

simplified fluid rendering

ea269e2

shaders for particle-grid

80ac1da

files for radix sort

90c5a5a

fixed shader typos

586b70b

Merge branch 'Devsh-Graphics-Programming:master' into flip-progress

1a6205c

first prefix sum impl

869b974

fixed typo bugs

9013205

radix sorting v1, wonky descriptors

fdf9dde

alt radix sort impl

e9f2489

fixed sort shader bugs

b9353f3

prepare grids with sort

6676476

implement dispatch particle-grid stuffs

0710f23

devshgraphicsprogramming reviewed Oct 11, 2024

View reviewed changes

devshgraphicsprogramming reviewed Oct 12, 2024

View reviewed changes

keptsecret added 21 commits October 16, 2024 11:04

split velocities by axis

ea47f35

update to origin

aac75f1

Merge branch 'master' into flip-opt-changes

f8cfe65

merge master and fixed conflicts

99b9837

tabs spaces

e819c9d

fixed velocity in pressure bug

70d73f2

cell material buffer to texture

4f66d96

diffusion grids buffer to texture

26cbbff

pressure grids buffer to texture

e6c8526

1D to 3D dispatch for grid

3880f0e

changes in preparation for future stuff

5168747

condensed update fluid cells step

09fb9f8

better pressure solve iteration

2d00c1e

fix and clean new pressure solve

38a3cc2

better diffusion iterations

c8393de

merged particle vel update into advectParticles

12391e6

switch to use BDAs

1ad5819

Merge branch 'flip-opt-changes' into flip-sort-alt

1e1360a

Merge branch 'Devsh-Graphics-Programming:master' into flip-sort-alt

77561f2

fixed with api changes

af0b661

simplified particle data structure

516515f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLIP fluid simulation example #145

FLIP fluid simulation example #145

keptsecret commented Sep 9, 2024

devshgraphicsprogramming Oct 11, 2024

devshgraphicsprogramming Oct 11, 2024

devshgraphicsprogramming Oct 11, 2024

devshgraphicsprogramming Oct 11, 2024

devshgraphicsprogramming Oct 11, 2024

devshgraphicsprogramming Oct 11, 2024

devshgraphicsprogramming Oct 11, 2024

devshgraphicsprogramming Oct 11, 2024

devshgraphicsprogramming Oct 12, 2024

devshgraphicsprogramming Oct 12, 2024

devshgraphicsprogramming Oct 12, 2024

devshgraphicsprogramming commented Oct 12, 2024 •

edited

Loading

FLIP fluid simulation example #145

Are you sure you want to change the base?

FLIP fluid simulation example #145

Conversation

keptsecret commented Sep 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devshgraphicsprogramming commented Oct 12, 2024 • edited Loading

devshgraphicsprogramming commented Oct 12, 2024 •

edited

Loading