Context
I am optimizing hierarchical transformation updates in a scene graph. My goal is to minimize cache misses during the parent-child transform propagation.
A common object-oriented scene graph suffers from lots of indirect accesses, so I tried a topologically sorted (i.e. the nodes are sorted by non decreasing ParentID) vector representation:
struct Mat4x4 {
float e00, e01, e02, e03;
float e10, e11, e12, e13;
float e20, e21, e22, e23;
float e30, e31, e32, e33;
};
struct Node {
Mat4x4 Transform;
int ParentID;
};
std::vector<Node> nodes;
The topological sort of the nodes vector is necessary to ensure correct order of propagation, i.e. not updating a children transform before its parent; the update process propagates transforms linearly:
for (size_t i = 0; i < nodes.size(); ++i) {
if (nodes[i].ParentID >= 0) { // the node has a parent
const Node &parent = nodes[nodes[i].ParentID];
nodes[i].Transform = ApplyTransform(parent.Transform, nodes[i].Transform);
}
}
The hot point within this loop seems to be the access to the parent node, and not the actual ApplyTransform
computations.
This observation comes from running a simple benchmark of the update function on a scene containing ~2000 characters, each represented as a tree of ~50 elements (a total of ~100000 nodes).
In that case perf stat
gives
stalled-cycles-backend:u # 48.04% backend cycles idle
and by looking at recorded profiling data, most of the time is spent on movups
and movaps
operations, indicating frequent cache misses when loading the data for the computations.
The benchmark code was compiled using g++ -O3
, and is run on a single core.
Question
I was wondering: How can I improve the memory layouts or traversal strategies for better cache efficiency for hierarchical transforms? Are there any specific techniques or patterns used in large scale real-time rendering for solving this problem?
I am not looking for open-ended discussions on scene graphs but rather concrete data structures or patterns that improve memory locality.