View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0005886||The Dark Mod||Coding||public||23.01.2022 10:31||24.11.2022 21:32|
|Product Version||TDM 2.10|
|Target Version||TDM 2.11||Fixed in Version||TDM 2.11|
|Summary||0005886: Add BVH-based optimization for whole-surface processing in renderer frontend|
|Description||There are several places in renderer frontend which processes the whole model's surface every frame:|
1) R_CreateLIghtTris filters only triangles which intersect light volume for light-entity interaction.
Without doing that, backend will have to render whole model per every light touching it.
2) R_CreateShadowVolume generates shadow volume from the whole mesh.
Sometimes there are large models (e.g. terrain) spanning large area, which are illuminated by many small lights.
This quickly becomes performance problem.
We can build BVH tree for large models, and use them to accelerate processing.
Instead of processing the whole model every time, traverse BVH tree: prune nodes fully unnecessary, accept whole nodes fully necessary, subdivide questionable ones.
|Steps To Reproduce||A) The issue happens on the first beta version of Moving Day by Jedi Wannabe.|
That's starting location: there is one huge model with boulders, and many moving lights inside bounding box.
B) The same problem happens in Black Mage in the lava cave.
Once again, the first beta version has this issue accented, since later versions have terrain patch split into several pieces.
|Additional Information||Original discussion here:|
|Tags||No tags attached.|
Committed everything in svn rev 9990.
All the stuff is controlled by cvars starting with "r_modelBvh", although some of them only take effect before a map is started.
Now there is no performance problem if we have a large static mesh (e.g. terrain) and some dynamic lights fly near it.
The BVH is built only for static models (under cvar r_modelBvhBuild), dynamic meshes are still processed with old code paths.
I implemented two algorithm for building BVH, and right now the one based on clustering is used: it is slower but produces better balanced trees.
BVH uses individual triangles of mesh as "elements", and size of leaves can be controlled by cvar r_modelBvhLeafSize (more granularity = more memory).
Aside from bounding box, every node also stores circular cone bounding triangle normals, which allows to prune triangles by facing.
Nodes are compressed to 12 bytes per node: bounding box is quantized relative to its parent, and bounding cone is just quantized in absolute coords.
Nodes are laid out as single array in BFS order: I believe this order is best for caching, since first K nodes are always topmost = the most frequently touched nodes.
The actual pruning consists of two algorithms inside idInteraction::CreateInteraction:
1) R_CreateLightTris: prunes away out-of-light ([stencil] and backfacing) triangles and returns mesh which is rendered with interaction shader.
2) [stencil] R_CreateShadowVolume: prunes away out-of-light and frontfacing triangles, then builds shadow volume by extruding the remaining tris.
The first part is enabled under cvar r_modelBvhLightTris.
One major question here is: should we do precise culling of all triangles, or cull only by BVH and put some excessive triangles to the output?
The cvar allows three modes which differ in this sense.
If we prune only by BVH, then only need to memcpy some subsegments of mesh's index buffer into the new mesh, but we produce slightly more triangles to transfer to GPU and render.
The big benefit is that we don't need to touch vertex buffer at all (recall that vertex buffer is 2.5 times larger than index buffer in best case!)
If we prune precisely, then we can only memcpy triangles in nodes which surely pass our filter: triangles in other nodes have to be processed individually.
When we process individual triangles, we have to touch vertex buffer and do some computations, which takes noticeable time.
Testing on Black Mage's lave cave has shown that precise per-triangle culling yields maximum FPS, so I set it as default.
Note that only indexes are filtered: vertex buffer of the full mesh is attached to the output mesh without any copying.
The second part is enabled under cvar r_modelBvhShadows, and is necessary only with stencil shadows.
Right now there are several algorithms for constructing shadows:
1) "turbo" shadow finds backfacing triangles within light volume, then extrudes them to infinity using "w = 0" trick (R_CreateVertexProgramTurboShadowVolume).
Vertex buffer is constructed in separate function R_CreateVertexProgramShadowCache and does not depend on light source.
2) ordinary shadow splits light volume of point light into 6 separate frustums, then builds shadow volume separately in each frustum (R_CreateShadowVolume -> R_CreateShadowVolumeInFrustum).
If triangles is partly within light volume, then it is physically cropped. Shadow volume is extruded to light volume boundary (instead of infinity).
3) Dmap also can use pass SG_OFFLINE flag, under which ordinary shadow is further optimized.
However, it is disabled in dmap... at least since D3 engine was merged, as far as I see.
In reality, p.1 is used almost always, with the exception that large meshes use p.2 (when span by X is larger than 3000 --- see idInteraction::CreateInteraction).
I only implemented BVH-assisted algorithm for turbo shadows.
Ordinary shadows generation looks too complicated for BVH integration.
However, since we want exactly to optimize shadow generation for large meshes, BVH code path is used regardless of size (as long as model is static, of course).
It probably means some increase in stencil fill rate, since we do no longer reduce the size of shadow volume, and it is only limited by scissors test.
The BVH-assisted algorithm for shadows is somewhat more complicated than for light tris.
At the first stage, we filter triangles which are 1) within light volume, and 2) backfacing --- that are called "shadowing"
The filtering is done by BVH on coarse, and some BVH leaves are then processed on per-triangle basis.
The shadowing triangles can be added to result as front cap, and then copied to infinity as back cap.
The hardest part is of course generation of side triangles: for that we need silhouette edges.
Silhouette edge is an edge which has shadowing tri on one side and non-shadowing tri on the other side (missing triangle is also counted as non-shadowing).
These are in fact all boundary edges of the filtered triangle set.
In order to detect these silhouette edges, we need to store which triangles are shadowing and which are not, so that later we can perform check against these bools.
This is trivial to do with full-mesh processing but with BVH we only process thoroughly some parts of the mesh --- intervals of primitives in index buffer.
So there is rather complicated system of storing this info in the intervals.
Another question is how to actually detect silhouette edges.
The full-mesh algorithm uses the list of all mesh's edges (called "silEdges"), where each edge stores indices of two vertices and two triangles on its sides.
While this is the most compact and efficient way to do it general, I found it very hard to use with part-mesh processing and BVH (e.g. where should edge between two BVH leafs belong to?)
In the end I decided to use different adjacency information: for each triangle, I store indices of adjacent triangles (called "adjFaces").
With this info, it is straight-forward to find silhouette edges by iterating over all shadowing triangles, then checking which of their adjacent triangles are non-shadowing.
I spent much time optimizing individual pieces of the described algorithms:
1) BVH nodes are compressed to 12 bytes per node (bvhNode_t, details written above).
2) New SSE-optimized bounding box check is used during BVH traversal (idRenderMatrix::CullSixPlanes).
3) Very complicated bounding cone check, avoiding any trigonometric or other heavy instructions (bvhNode_t::HaveSameDirection).
4) New SSE-optimized code for detecting which triangles are within light volume (idSIMD::CullTrisByFrustum).
Speaking of FPS in real maps: there is no difference in most cases.
If I take target cases (first beta of Moving Day, or specially crafted test map), the FPS there improves dramatically.
But in most of the maps the surfaces in interactions are pretty small, and there is no noticeable difference from adding BVH.
Moreover, in some cases FPS can actually drop due to additional overhead.
The only example where BVH culling matters and not always improves FPS is: Black Mage's lava cave.
The slowdown is caused by light tris filtering (shadows always take much less time) and depends on value or r_modelBvhLightTris.
For instance, if I use value 2 (accept some backfacing triangles) instead of 3 (precise per-triangle check), then FPS drops by 10-20 (with typical number about 120).
UPDATE: Special test map for BVH show-off is committed in svn rev 16531 (test_5886_bvh).
It turned out that I broke shadows generation during dmap.
Because dmap never uses SG_DYNAMIC mode, and expects that vertex buffer will be generated explicitly (as ordinary shadow generation does).
So I have changed the rules in svn rev 9991.
Now we always use SG_DYNAMIC during gameplay (regardless of size and whether model is static/dynamic),
and we always use SG_STATIC during dmap.
Very simple =)
And the last thing: testing and debugging.
I added quite several tests, almost all of them are "stress-tests", i.e. they generated many random inputs and check outputs against slow implementation.
* "BvhChecks:Sphere" --- tests R_CullBvhByFrustumAndOrigin on tessellated sphere, and thus tests R_BuildBvhForTri too.
* "CircCone:Simple" and "CircCone:StressCheck" --- sanity checks and stress tests on circular cones.
* "BvhChecks:HaveSameDirection" --- tests same-named method for culling backfacing surfaces based on bounding cone.
The actual code for building light tris and shadow volumes is not covered by unit tests:
* idSimd::CullTrisByFrustum (called from the two functions just above)
Maybe I should write some tests for it too, don't know.
Of course, they work all the time during gameplay, so obvious errors there should be detected pretty quickly.
Also, there is debug cvar r_modelBvhShow to see BVH nodes.
It takes BVH tree, uncovers its nodes recursively up to specified depth (e.g. value 5 means uncover up to depth 5), then renders each node's triangles with random flat color.
So as you set cvar to progressively higher number, you can see how BVH nodes are split into smaller subnodes.
This is useful to look at quality of BVH construction.
Unfortunately, r_modelBvhShow cvar does not work with SMP on.
Moreover, since right now backend renders previous frame with SMP off (see 0005875), the picture makes sense only if you don't move and objects don't move.
Some more related commits:
r9995. Enabled optimization in more functions in Debug With Inlines configuration.
r9996. Use idSimd::MinMax for computing bounds of precisely culled tris.
The main goal is to reduce runtime of new BVH functions in Debug With Inlines configuration, although Min/Max change should accelerate things in Release too.
Daft Mugi found a scene where BVH-based code for shadows is noticeably slower:
Of course, Frontend Acceleration makes the difference irrelevant.
In this particular case (Written in Stone at 308 -1768 123 -13.7 68.3 0), the models are lit almost completely, so BVH cannot make things much better.
On the other hand, BVH algorithm inevitable performs more computations and has more overhead, so it gets slower.
So, here are a few general changes:
r10171 Fixed a major bug in idFlexList::SetNum!
r10173 Added idBitArray container and utility functions.
And some changes specific to this code:
r10170 Allow tweaking granularity of BVH traversal at runtime + don't use BVH for surfaces smaller than granularity.
r10172 Don't use idFlexListHuge.
r10174 Reimplemented shadow volume computation with BVH by storing all triangles in a bitset.
The last commit dramatically changes shadow-volume computation code.
While the algorithm stays the same, the data structure for storing which triangles are shadow-casting is replaced with a persistent bitset.
It allows to keep good complexity on cases like "small light with huge mesh", while reducing overhead in more typical cases.
Also, now we have a few new cvars:
They have double effect:
1) BVH traversal does not recurse into nodes smaller than X triangles: so it returns less intervals, but with larger total length.
2) surfaces with less than X triangles are processed using the old non-BVH code.
|23.01.2022 10:31||stgatilov||New Issue|
|23.01.2022 10:31||stgatilov||Status||new => assigned|
|23.01.2022 10:31||stgatilov||Assigned To||=> stgatilov|
|14.07.2022 11:01||stgatilov||Note Added: 0015010|
|14.07.2022 11:01||stgatilov||Note Added: 0015011|
|14.07.2022 11:06||stgatilov||Note Added: 0015012|
|14.07.2022 11:19||stgatilov||Note Added: 0015013|
|14.07.2022 11:43||stgatilov||Note Edited: 0015013|
|14.07.2022 11:45||stgatilov||Note Added: 0015014|
|14.07.2022 13:02||stgatilov||Note Added: 0015015|
|14.07.2022 13:02||stgatilov||Status||assigned => resolved|
|14.07.2022 13:02||stgatilov||Resolution||open => fixed|
|14.07.2022 13:02||stgatilov||Fixed in Version||=> TDM 2.11|
|14.07.2022 20:30||stgatilov||Note Added: 0015021|
|24.11.2022 21:32||stgatilov||Note Added: 0015467|