Volume 04 · Issue 11 · April 2026

An Engineering Journal of the MiRiS Game Engine.

Long-form technical writing on architecture, rendering, memory, and the hard problems of building a real-time engine. Published irregularly. Read at your pace.

since 2022
Editor
R. Vellichor
Established
Reykjavik, 2022
Cadence
Quarterly, more or less
ISSN
2767–0044

A small theory of frame budgets

On treating the 16.6 ms frame as a contract between subsystems, rather than a goal we hope to meet.

A frame is not a duration; it is a budget. The distinction matters because budgets are negotiable, and durations are not. When we treat 16.6 ms as a goal, every subsystem feels licensed to be late. When we treat it as a contract, lateness becomes a defect with an owner.

This piece collects four years of frame-pacing notes from MiRiS. None of it is novel; all of it is hard-won. We start with the simplest model and add nuance only where the engine forced us to.

1.1 · The naive model

The naive model says: a frame is a function from Staten to Staten+1, and rendering is a side-effect. We measure the function. If it averages under 16.6 ms, we ship.

This model is wrong in three ways. It hides variance behind averages. It conflates the simulation step with the render step. And it ignores the GPU as an asynchronous coprocessor with its own budget.

1.2 · Variance is the product

Players do not feel an average. They feel the worst frame in a hundred. A 12 ms average with a 28 ms 99th-percentile is, in lived experience, a 28 ms game. Optimization that lowers the mean while raising the tail is a regression dressed as a win.

We track three numbers per subsystem, every frame: median, 95th, and 99th. The 99th is the only one that goes in the budget table. It is a harsh standard; it is the right one.

cpp budget_table.cpp
// 16.6 ms total. Numbers are p99, not mean.
constexpr FrameBudget kBudget = {
  .input        = 0.4,   // ms
  .gameplay     = 3.2,
  .animation    = 2.1,
  .physics      = 2.8,
  .culling      = 1.0,
  .render_setup = 1.7,
  .audio_mix    = 0.6,
  .slack        = 4.8,   // for the GPU and the unknown
};
static_assert(kBudget.sum() <= 16.6, "frame overcommitted");

1.3 · The slack line

The last entry in that table — slack — is the most important one. It is the budget for everything we did not foresee: a level that loads a few more lights, a designer who adds one more particle system, a driver that hiccups for reasons unknown. A frame without slack is a frame already broken.

The cheapest optimization is the one we never have to make, because slack absorbed the surprise.

1.4 · Subsystems own their numbers

A budget is a contract; a contract needs a counterparty. Each subsystem in MiRiS has a single human owner, named in the source, responsible for the p99 of their slice. When the number drifts, the owner is paged — not the build engineer, not the producer, not the on-call generalist.

This sounds bureaucratic. It is the opposite. Bureaucracy is a budget that nobody owns, drifting upward by a tenth of a millisecond per sprint, until one Tuesday the game stutters and a five-person retrospective discovers that everyone contributed and nobody is responsible.

The footnote[1] at the foot of this article expands on the paging policy and the on-call rotation, which we revised twice before it stopped causing resentment.

1.5 · The GPU is a separate country

Everything above is the CPU's frame. The GPU has a frame too, and it speaks a different language. Submitting work in 4 ms of CPU time is meaningless if the GPU then takes 22 ms to drain it. We keep two budget tables, one per device, and a third that tracks the queue between them. The queue is where deadlocks hide.

We will return to the GPU side in §2. For now: trust nothing that is not measured on the device that does the work.


[1] The on-call policy is documented in the engineering handbook, chapter 7. The short form: an owner is paged once per regression, never twice in a week, and never on the weekend unless the regression is shipping-blocking. The point is to make the contract real, not to make engineers miserable.

Clustered forward, six years on

A retrospective on the rendering path that survived three GPU generations and one rewrite that we abandoned at 60% complete.

In 2020 we chose a clustered forward renderer for MiRiS, against the recommendation of two consultants and the prevailing fashion of the moment. It was the right call. It is also the call we have second-guessed at least once a year since.

2.1 · Why clustered, why forward

The choice was driven by four constraints: many small lights, transparent surfaces with correct shading, a wide range of target hardware, and a small graphics team. Deferred rendering solves the first; it complicates the second; it widens the gulf with low-end hardware; and it asks more of a small team than we could give.

Clustered forward gave us per-pixel light culling without a G-buffer, transparency that reads the same code path as opaque, and a memory profile that scaled cleanly from a Steam Deck to a high-end desktop. The cost was a more complex tile/cluster build step, which we now think is the easiest part of the pipeline to maintain.

2.2 · The cluster grid

Our cluster grid is 16 × 9 × 24: sixteen tiles wide, nine tall, twenty-four slices deep. The depth slices are exponential, denser near the camera, because lights cluster near the things players look at.

hlsl cluster_assign.hlsl
float SliceFromZ(float viewZ)
{
    // Exponential slicing. kNear/kFar from the camera; kSlices = 24.
    const float ratio = kFar / kNear;
    return floor(log(viewZ / kNear) / log(ratio) * kSlices);
}

uint3 ClusterIndex(float4 svPos, float viewZ)
{
    uint2 tile  = uint2(svPos.xy) / kTileSize;
    uint  slice = (uint) SliceFromZ(viewZ);
    return uint3(tile, slice);
}

2.3 · What we got wrong, twice

We got the cluster build wrong twice. The first version assigned lights to clusters on the CPU and uploaded a fresh buffer every frame. This worked for two hundred lights and fell over at a thousand. The second version moved the build to a compute shader but used a naive sphere-versus-frustum test that produced 40% false positives. Useful frames, useless lights.

The third version, current, uses a tighter sphere/AABB intersection in cluster space and a small precomputed table for the slice planes. The false-positive rate is around 6%, which we have decided to live with.

  1. Build the cluster bounds once per camera, not per frame.
  2. Test light spheres in cluster-space, not view-space — the math is shorter and the FP error is smaller.
  3. Use a 32-bit-per-cluster light-bit field, not a list. Variable-length lists fragment the cache.
  4. Always overflow gracefully. We log when a cluster exceeds 64 lights; we do not crash.

2.4 · The rewrite that we abandoned

In 2024 we started a visibility-buffer renderer. The plan was: render IDs first, shade in screen space, free ourselves from the cluster machinery. We got it 60% complete, including a working production prototype, and then we stopped.

The reason we stopped is unromantic. The visibility-buffer path won on the GPUs we already shipped well on, and lost on the GPUs where we needed the wins. The Steam Deck especially — a hardware target we cannot ignore — preferred the clustered path by a clear margin in our test scenes. We archived the branch and went back to improving what worked.

A renderer that is faster on hardware you already ship well on is not faster. A renderer that is faster on hardware you struggle with is the only renderer that matters.

2.5 · Where we are going

The roadmap for clustered forward in MiRiS is short: better depth pre-pass utilization, a tighter shadow-cascade selection per cluster, and an option to drop transparency to a separate forward pass when the artists really, really need fifty alpha-blended particles overlapping.

None of this is glamorous. It is, however, the work that pays the rent.


[2] Source for the scene benchmarks referenced above lives in tools/bench/scenes/. The light-build microbenchmark is in tools/bench/cluster_build_bench.cpp. We are aware that our reference scene is unrepresentatively dense; we keep it that way on purpose.

A profiler that respects the timeline

Why we wrote our own in-engine profiler, what we copied, what we discarded, and the one feature we still regret.

There are many good profilers. None of them think about the engine the way the engine thinks about itself, and that mismatch is, eventually, the limiting factor in how much you can debug.

3.1 · The thing we wanted

We wanted three things from a profiler: a per-frame timeline that survives a hitched frame, named regions that nest cleanly, and an output format that we could diff in CI. Out-of-the-box tools gave us two of the three; the diff-in-CI part was always missing or required so much glue we may as well have written it ourselves.

So we did. The profiler is two files plus a viewer. It is small enough that we read all of it during onboarding, every time, in front of new engineers.

3.2 · The data model

Each profiler region is a stack push and pop. We store: thread id, depth, name pointer, start tick, end tick. We do not store anything else. Five fields, sixteen bytes per region after packing.

We allocate region records out of a per-thread ring buffer that holds three frames. Three is enough to inspect a hitched frame and the two around it; more is bookkeeping nobody reads.

cpp profiler.hpp
struct Region {
  uint64_t startTick;
  uint64_t endTick;
  const char* name;     // pointer into .rodata, never freed
  uint16_t threadId;
  uint16_t depth;
};
static_assert(sizeof(Region) == 24, "region grew, audit callers");

class ScopedRegion {
  Region* r_;
 public:
  ScopedRegion(const char* name) {
    r_ = PushRegion(name);
  }
  ~ScopedRegion() {
    PopRegion(r_);
  }
};

3.3 · The viewer

The viewer is a tiny WebGL canvas that takes a JSON dump and draws boxes. It supports zoom, pan, and a search field. It does not support flame graphs, statistical roll-ups, call-graph reconstructions, or symbolic demangling. We tried all four. None of them helped us find a real bug.

What did help: the search field. Type a region name; non-matching regions fade to #e0e0e0. The hitched frame announces itself. The profiler does the looking; the engineer does the thinking.

3.4 · The one regret

We wrote a feature called “auto-instrument” that wraps every function in the engine in a scoped region at compile time. It seemed like a good idea. It produced 80,000 regions per frame. The viewer choked. The ring buffer choked. Our cache choked. We removed it three weeks later, and the morale of the team rose visibly.

Telemetry is a thing you choose to keep, not a thing you choose to collect. Anything you collect by default, you cannot read, by definition.

The lesson, if there is one: a profiler is a microscope, not a sieve. Point it at a thing. If you point it at everything, you have built a fog machine.


[3] The viewer source is in tools/profview/. It is roughly 1,200 lines, including the JSON loader. The two engine-side files are runtime/profiler.hpp and runtime/profiler.cpp. There are no other files. There never will be.

Colophon & corrections

Notes on the production of this issue, and amendments to previous ones.

Set in
IBM Plex Sans, IBM Plex Serif, IBM Plex Mono.
Composed on
A 13-inch laptop in a quiet room, mostly at night.
Built with
Static HTML, hand-written CSS, no frameworks.
Errata
Issue 10, §2.3: the figure stated 12% false positives. The correct figure is 6%. Thanks to E. Lindqvist for the catch.

— The Editor