Nako - The beginning

Date: 01.02.2022

Background

Nako is a signed distance function framework that allows the user to define those functions at runtime (in Rust). An accompanying renderer is able to take such a function, upload it to the GPU and render a frame based on such a function.

Those functions can be combined freely to form finite, or infinite fields. A nice example is the snail scene below.

Signed distance field functions are only one way to define geometry. Two other popular methods exist. The most widely used is triangle based representations. They define sets of triangles based on points in space. Those can be renderer efficiently on GPUs. They have the advantage of a small memory footprint as well as mature art pipelines.

Another approach are voxels. They are basically the 3D equivalent of pixel graphics in 2D. Their advantage is trivial manipulation of the voxel data as well as the capability to represent volumetric data.

Signed distance field combine multiple aspects of vertex and voxel based approaches. The advantages being volumetric representation and a small memory footprint. It is also easily possible to define infinite fields and there is no resolution associated with the function.

The downsides are an immature art pipelines and (as far as I know) the unsolved issue of “how to render dynamically defined SDFs on the GPU”. While scenes like the shader toy above showcase the power of SDFs, those are usually hard-coded into the shader, or a voxel volume is employed at runtime to translate an SDF into a renderable structure. See Claybook and Dreams for two excellent presentations on how to render SDFs this way.

However, I wanted to try and find a new way that does not introduce the resolution, or memory footprint problem associated with big voxel volumes. This is the idea from which work on Nako started.

Developing Nako

The intended usage of Nako is runtime definition of SDF for 3D and 2D, as well as the ability to project any 2D-SDF into 3D. While implementing the prototype I also added a compositor-like layering system to the renderer. This allows rendering different scenes to different location of the screen.

GPUs and stack machines

In practice when defining an SDF a hierarchy, or tree of signed distance field operations makes up a more or less complex model. To represent and evaluate this tree of operation a stack machine is needed. This led to the first prototype of Nako that was finished with commit #4e03b365 for the CPU.

While implementing the prototype for the GPU a problem became apparent that is already known for stack-based tree traversal on GPUs:

Stacks do not perform well on the GPU.

As specially when trying to track more information than just “distance to surface”, like color, or normal properties the stack needs a lot of registers. On GPUs however registers are rare. So I started to revise my plan of a GPU-based stack machine.

The following picture shows the tree for a simple scene. When traversing the tree and saving the intermediate results the stack grows to a maximum of 4 values.

Tree stack allocation scheme

This tree however can be rearranged to consume less stack space, since the operations “union” is commutative within SDFs like this:

Optimized tree stack allocation scheme

This reduces the stack height to two elements. However, this optimization cannot be done for non-commutative operations, like subtraction or “finite repetition”. The big advantage however is, that the “stack” always has at max two elements. This allows the second element to be manipulated (set color, translation etc.) and, when finished be combined with the “rest” of the scene.

The concept of PrimaryStream (the first stack element) and SecondaryStream (second or active stack element) was born. Nako gained some high level primitives and could then be used to define SDFs at runtime like this:

stream = stream
        .push(
            SecondaryStream::new(Union, Plane {height: -100.0, normal: Vec3::Y})
            .push_mod(Color(Vec3::new(0.95, 0.8, 0.8)))
            .build(),
        )
        .push(
            SecondaryStream::new(Union, Cuboid { extent: walls })
                .push_mod(Translate(p))
                .push_mod(Color(color))
                .build(),
        )
        .push(
            SecondaryStream::new(
                Union, 
                Cuboid {extent: Vec3::new(walls.z, walls.y, walls.x)}
            )
            .push_mod(Translate(
                p + Vec3::new(walls.x - walls.z, 0.0, walls.x + walls.z)
            ))
            .push_mod(Color(color))
            .build(),
        )
        //...
        .push(
            SecondaryStream::new(SmoothUnion(10.0), Sphere { radius: 25.0 })
                .push_mod(Translate(Vec3::new(-250.0, -85.0, 1.0)))
                .push_mod(Color(Vec3::new(0.2, 1.0, 0.2)))
                .build(),
        )
        .push(
            SecondaryStream::new(SmoothUnion(50.0), Sphere { radius: 30.0 })
                .push_mod(Translate(Vec3::new(-250.0, -100.0, 25.0)))
                .push_mod(Color(Vec3::new(0.75, 0.0, 0.2)))
                .build(),
        )
        .push(
            SecondaryStream::new(SmoothUnion(50.0), Sphere { radius: 45.0 })
                .push_mod(Translate(Vec3::new(-250.0, -100.0, 160.0)))
                .push_mod(Color(Vec3::new(0.5, 1.0, 1.0)))
                .build(),
        );

This concept worked well for a year. It is enough to create finite objects that can be rendered through sphere tracing, or even segment tracing with some hacks.

When combining more than ~25 primitives at a time the runtime of the sphere tracing pass became too slow. After some debugging I found the byte-code buffer access to be the main problem source (next to the “near surface” problem of traditional sphere tracing).

The GPU interpreter looks more or less like this:

let mut word
while word < byte_code_len{

    let op = byte_code_buffer[word]; //<-- takes a long time
    
    handle_op(op);
    
    word += 1;
}

As specially when having high thread divergence loading different words from the byte code and executing different function based on those words becomes slow.

Revising the base code

At this point I started to rethink my approach. The stream based SDF definition had the advantage of a fixed size stack. But in practice it lacks a lot of the power a tree-based SDF has (stuff like infinite repetition, wrapping combined primitives etc.). Those shortcomings become as specially apparent when trying to model infinite models.

Obviously it has to be possible to calculate tree-based SDFs fast on the GPU, the Shadertoy at the top is proof of that.

At this point it played into my hands that Embark is currently working on a Rust to SpirV compiler. My shaders where already written in Rust and compiled using just that compiler so the question was:

Can I somehow use the compiler at runtime to inline my SDF directly into the shader code?

This should, in theory, result into the same performance characteristics the Shadertoys have, since the SDF would be encoded in the shader-code itself, instead of a byte code buffer that is interpreted at runtime.