View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0003984||The Dark Mod||Graphics||public||22.12.2014 02:33||09.06.2018 18:38|
|Summary||0003984: GPU vertex skinning|
|Description||Animated (MD5) meshes and particle effects currently work out the exact position of each vertex of each mesh on the CPU, and then load the results to the GPU every frame. Both these things take time, and the result is that we can typically have only 3 or 4 AI on screen before losing FPS.|
The animation system is responsible for deciding where an AI's (or any md5 mesh's) skeleton is. Skinning is the process of fitting its visible surfaces over the current skeleton configuration. That can be done on either the CPU or the GPU. We currently use only the CPU. The best solution is to do it on *both* CPU and GPU, because doing the sums twice is less costly than passing the results between processors.
Ideally, the CPU will skin only the collision model (where needed, approximations are used most of the time) and any shadowcasting meshes, because it needs to know those results for other purposes. The GPU will skin the shadow mesh and the remaining visible meshes, which the CPU doesn't care about.
|Additional Information||The GPU has 100s or 1000s of processors that can perform this task in parallel, unlike the single CPU processor that does the task now, so letting the GPU do it will be better. Plus, the uploading of results to the GPU takes even longer than calculating them. |
Example: A single tdm_ai_citywatch seen close up with one shadowcasting light and one ambient_world light transfers 825kb of data to the GPU each frame for its visible meshes and shadow volumes.
The fast memory cache reserved on the gpu for current-frame-only meshes in TDM is only 2MB, so 3 AI or a moderate rainfall are enough to exceed it and make slower memory kick in. We could try enlarging that, but we won't need to if we get GPU skinning working.
Another problem I've spotted in researching this: 212kb of the 825kb for a city watch guard is duplicated data. Again no need to fix that if we have gpu skinning.
There is only 1 copy of the base MD5 mesh held in CPU memory for all AI using a given modelDef in a map. That base copy is used together with each AI's unique skeleton configuration to generate the individual drawn surfaces for those AI each frame. But those draw surfs are all skinned and uploaded individually to the GPU every frame.
With GPU skinning, a single copy of the modelDef will be held on each of the gpu AND the cpu, and what gets passed from the cpu to the gpu each frame for each AI is just the skeleton configuration: a matrix for each joint which totals 6kb per AI instead of 825kb.
|Tags||No tags attached.|
The changes made in Doom3 BFG for GPU Skinning:
- In Model.h: The basic surface type (triangle mesh) has had an extra pointer added: idRenderModelStatic* staticModelWithJoints. That's used for meshes whose vertex positions are determined by a set of joints.
- idMD5Mesh -- which represents a modelDef, the data shared by all AI of a given type has had several changes:
- - idMD5Mesh::ParseMesh builds an AI mesh in base pose (T-pose) which it loads to the GPU as a static vbo. That's the base mesh used to draw all AI of that type later. It also generates a list of matrices, one for each joint, that translate from model space into joint space.
- - idMD5Mesh::UpdateSurface chooses wehther to skin a given mesh for a given frame on the CPU or GPU. It used to do CPU skinning in all cases. In BFG it can choose to produce a set of updated joint matrices instead, whixch'll be used on the GPU to skin the mesh.
- idRenderModelMD5::InstantiateDynamicModel used to call idMD5Mesh::UpdateSurface for each mesh. In BFG it builds the skinning matrix, a matrix for each joint in its current position which will be used to move all attached vertices into position.
This is a new file in BFG, which inherits some of the functions from (missing) tr_light.cpp, including R_AddModels, now with more variants. Does VBO allocation for loading the current joint configuation for an AI to the GPU. Uses a single-frame vbo, so it gets replaced once per frame. Tests whether it's current for the frame.
- - RB_RenderInteractions chooses between versions of the usual shader programs for light-interacting surfaces or shadow volumes -- each program now comes in 2 flavours, for skinned or unskinned meshes.
- - RB_DrawElementsWithCounters binds the uniform buffer (current joint config) and does the rest.
- - has been upgraded to allow a third type of vbo -- the list of current joint matrices for a skinned mesh. This is a uniform buffer object, not an array of vertex data.
- The shader system is now pure GLSL (no ARB assembly shaders). That enables some of the above techniques.
Some parts of the above implementation are supported only by GLSL shaders, and we won't be able to use them unamended with our ARB shader programs.
Upgrading to allow GLSL shaders is an obvious later step for TDM engine, but there are reasons to push ahead with GPU Skinning and soft shadows first -- i.e., that upgrading to GLSL might take a very long time or not get finished.
The BFG implementation is still a close fit for our engine, so we'd use much of it, but we'd need some changes:
- VertexCache -- we can leave this unamended. We can't use the new uniform buffer objects for storing joint matrices until we have GLSL.
- Model_MD5 changes: we'd want these. The routine to build the initial bind pose, and the changes to InstantiateDynamicModel which makes a skinning matrix for each frame. But we'll need to break the skinning matrices into a more compact representation -- a quaternion plus a vector offset (7 floats) instead of a matrix (minimum 12 floats).
- tr_frontend_addmodels: we won't use this change. We can't (yet) use uniform buffer objects. We will have to load the skinning matrices during the backend
draw, using the environmental and local shader program variables.
The big thing is we can't pass as much data to the gpu per AI as BFG's GLSL implementation does. It uses a set of 102 4x4 matrices (=408 "registers") to hold the current position and orientation for each joint -- so the joint limit for any AI is 102 (not counting joints that are not used for skinning).
We will have to deal with tighter limits. My GPU allows 4096 regsiters to be passed as uniforms to GLSL programs in a single draw call, but only 256 registers for ARB programs. And the spec for ARB says that cards have to support only 96 registers, so we should try to stay in that limit.
We can get a second set of 96 registers by using both environmental AND local parameters, giving 192 to play with.
The engine already reserves 21 environmental registers (numbers 0-20) for drawing light interactions, and 4 local registers for the "VertexParm" parameters that people can specify in shader programs.
So we have range 21-95 of each set clear to play with. That's a maximum of 76 joints, if we put the joint's orientation as a quaternion in one list, and the joint's position as a vector in the second list. Each of those can be packed into a single register.
TDM's most complex AI models use 68 joints for skinning, so it's enough. Any meshes that wanted to use more joints in future would have to continue to be skinned on the CPU, at least till we implement GLSL.
BFG packs the vertex->joint mappings and weights into the original mesh's "color" settings. I'll have to check whether we use vertex coloring for any light interactions. If we do, then we can't do that. Ideally we'd find somewhere else to put them so we maintain maximum flexibility.
Fortunately, there's not much code that creates VBOs, and even less that's used by MD5 meshes. It all sits in idVertexCache, ::Alloc() and ::AllocFrameTemp(). This is a clean design and it shouldn't be too hard to work with.
::AllocFrameTemp() deals with the memory meant to be used for single-frame meshes. Two buffers are maintained of 2mb each, so that one can be written to while the other is being read from ("double-buffered"). In theory anyway. We don't need it while we are using single-threaded architecture, and it isn't used much. Only used by gui drawing, weather patches, skybox texgens, and some decals.
::Alloc() is the one used by md5 meshes (as well as the rest of the world's solid objects).
Only 2 functions pass lists of mesh vertices to the GPU (they use "Alloc"):
- R_CreateAmbientCache -- which does visible surfaces
- R_CreateVertexProgramShadowCache -- which does shadow volumes
For a static model, this only needs doing once at map start. For md5 meshes, it's done every frame while they are in view.
Those functions pass the entire mesh to the GPU. Individual lights will hit subsections of those meshes.
Only 2 functions pass indexes to the GPU. The indexes are what specifies subsections of the meshes loaded above, for individual light interactions:
- idInteraction::AddActiveInteraction -- which deals with lit materials and shadow volumes
- R_AddAmbientDrawsurfs -- which deals with unlit materials (or unlit stages of a material decl)
A given light will probably hit less than half the tris belonging to an AI (because the other half will be facing away). So the interaction between that light and the AI mesh will need a subset of the vertices that got uploaded by R_CreateAmbientCache above. That subset is identified by an index, and the index needs calculating and uploading every frame, as the subset of tris changes frame by frame.
Likewise the shadowcasting sillhouette changes frame by frame. That gets uploaded as a index to the shadow volume verts too.
Typical figures in bytes for the allocations in 1 frame, for one AI hit by 1 shadowcasting light and one ambient:
R_CreateAmbientCache 273120 (Verts)
R_CreateVertexProgramShadowCache 74176 (Verts)
idInteraction::AddActiveInteraction 190560 (Index)
R_AddAmbientDrawsurfs 66312 (Index)
The aim of GPU skinning is to save the Verts allocations, or just over 57% of the total.
We might be able to cut down on indexes too. The duplicate data mentioned in the original report is duplicate index data coming from idInteraction::AddActiveInteraction, but that's probably a separate task.
Also, I need to check the BFG code again to see how it handles he indexes for interactions. Maybe (hopefully) it's doing something different. To be able to calculate the indexes on the CPU as is done now, it would need to skin the visible meshes on the CPU too, else it wouldn't know which tris face the light and which don't.
Moving objects like func_rotators and func_movers don't upload new vertex data every frame. The verts are static because they are measured in relation to the model's origin and axis, and those measurements don't change when the model moves. It's just the model origin and axis that change. Instead of uploading new verts, they just create new indexes for light interactions, to say which bits of the model are hit by lights.
Likewise a moving light doesn't cause static models to re-upload their vertex data, but again the indexes are nvalidated and get recreated and re-uploaded.
For static models where the lights are not moving, none of the above gets recreated or re-uploaded. Both vertex and index vbos are re-used frame to frame.
BFG only did half the job with vertex skinning, and we shouldn't stop where it left off. It saves about half the traffic between the CPU and GPU (the vert locations for animated meshes), but it still relies on the CPU calculating and uploading the other half each frame: the indexes (lists) of triangles that are hit by a light each frame, and the shadowcasting sillhouette of each model.
BFG is still an OpenGL2 renderer. It uses GLSL but it doesn't take advantage of any of the OpenGL3 facilities that could cut out the rest of the traffic. We could try using a geometry shader to discard unlit triangles, for example, and to work out the shadow sillhouette.
Doing this work on the GPU instead of the CPU is essential because we want to use Instancing for drawing multiple copies of a model in one draw call, which means we'll want to use the same lists of tris for every copy of a model. No more having the CPU work out what tris get hit by a light, which makes it maintain and upload a different Index for each copy of a model.
One example that we do want to follow from BFG: it adopts a "stateless" system where none of this info is cached between frames, or even between draw passes in a single frame. It's all re-calculated every time it needs to be used. That's a good thing both for code simplicity and for performance. Since the engine was designed, number-crunching has become relatively much cheaper than storing and retrieving blocks of data.
Code freeze: 10/4/2017
Possibility of completion in 2.06? 0%
Moving to 2.07
|22.12.2014 02:33||SteveL||New Issue|
|22.12.2014 02:38||SteveL||Assigned To||=> SteveL|
|22.12.2014 02:38||SteveL||Status||new => assigned|
|22.12.2014 02:38||SteveL||Relationship added||child of 0003684|
|22.12.2014 03:06||SteveL||Description Updated||View Revisions|
|01.01.2015 20:51||SteveL||Note Added: 0007291|
|01.01.2015 20:54||SteveL||Note Edited: 0007291||View Revisions|
|01.01.2015 21:11||SteveL||Note Added: 0007292|
|01.01.2015 21:14||SteveL||Note Edited: 0007292||View Revisions|
|01.01.2015 21:19||SteveL||Note Edited: 0007292||View Revisions|
|03.01.2015 12:38||SteveL||Note Added: 0007296|
|03.01.2015 12:41||SteveL||Note Added: 0007297|
|03.01.2015 12:41||SteveL||Note Edited: 0007297||View Revisions|
|03.01.2015 12:46||SteveL||Note Edited: 0007296||View Revisions|
|03.01.2015 12:50||SteveL||Note Edited: 0007297||View Revisions|
|03.01.2015 16:05||SteveL||Note Edited: 0007296||View Revisions|
|14.02.2015 16:04||SteveL||Note Added: 0007420|
|14.02.2015 16:05||SteveL||Note Edited: 0007420||View Revisions|
|30.12.2015 15:28||SteveL||Target Version||TDM 2.04 => TDM 2.05|
|22.11.2016 20:24||nbohr1more||Product Version||=> SVN|
|22.11.2016 20:24||nbohr1more||Target Version||TDM 2.05 => TDM 2.06|
|15.02.2017 04:36||grayman||Assigned To||SteveL =>|
|15.02.2017 04:36||grayman||Status||assigned => new|
|17.09.2017 18:57||nbohr1more||Note Added: 0009274|
|17.09.2017 18:58||nbohr1more||Target Version||TDM 2.06 => TDM 2.07|
|09.06.2018 18:38||nbohr1more||Target Version||TDM 2.07 =>|