The poor man’s render graph

Date: 09.05.2022

Synchronization in Vulkan is hard

If you wrote something using one of the new low-level APIs you will most certainly have come across the pain of synchronizing data. As far as I understood it is similarly difficult in DX12 as it is in Vulkan. Metal seems to be a little more friendly to the developer. When creating small applications where each frame submits more or less the same commands it is enough to do this by hand. So you start writing command-buffer recording steps like this:

dev.cmd_pipeline_barrier(
  *cmd,
  ash::vk::PipelineStageFlags::TOP_OF_PIPE,
  ash::vk::PipelineStageFlags::COMPUTE_SHADER,
  ash::vk::DependencyFlags::empty(),
  &[], //mem
  &[], //buffer
  &[
    //Transfer attachment image from UNDEFINED to SHADER_WRITE
    ash::vk::ImageMemoryBarrier {
      image: image.inner,
      src_access_mask: ash::vk::AccessFlags::NONE,
      dst_access_mask: ash::vk::AccessFlags::NONE,
      old_layout: ash::vk::ImageLayout::UNDEFINED,
      new_layout: ash::vk::ImageLayout::GENERAL,
      subresource_range: image.subresource_all(),
      src_queue_family_index: queue_graphics_family,
      dst_queue_family_index: queue_graphics_family,
      ..Default::default()
    },
    //Move swapchain image to presetn src, since the later barrier will move it into transfer
    //dst assuming it was on present src khr.
    ash::vk::ImageMemoryBarrier {
      image: swimg.inner,
      src_access_mask: ash::vk::AccessFlags::NONE,
      dst_access_mask: ash::vk::AccessFlags::NONE,
      old_layout: ash::vk::ImageLayout::UNDEFINED,
      new_layout: ash::vk::ImageLayout::PRESENT_SRC_KHR,
      subresource_range: swimg.subresource_all(),
      src_queue_family_index: queue_graphics_family,
      dst_queue_family_index: queue_graphics_family,
      ..Default::default()
    },
  ],
)

This works until you decided that the commands, and therefore probably some buffer and image state should be different from time to time. For instance when changing the shading based on some event, or when implementing some debugging output. This becomes even worse when starting to use multiple queues. For instance AsyncCompute or transfer queues for asynchronous data up/download. At that point handling all the synchronization correctly becomes pretty hard.

A reason for the difficulty might be the three-layered synchronization that is common in Vulkan applications. You got PipelineBarriers for command-to-command synchronization. There are Semaphores for CommandBuffer-to-CommandBuffer synchronization, and there are Fences to synchronize your CPU-side program with whatever the GPU currently does. Timeline Semaphores break up the difference between Semaphores and Fences, but I did not use them yet.

As specially when writing bigger applications doing all this by hand is not feasible. That’s why render-graphs or frame-graphs were invented to automate this kind of works. There are multiple implementations with varying complexity. My primary inspiration however is Kajiya. Mostly because I like the less-code approach of the API. I did not want to use multiple months to come up with a solution anyway, so my implementation has a helper-like character, less an all-in-one solution for frame management, submission etc.

API Overview

For the user there are two main parts.

  1. Graph
  2. Pass

The links are always to the most recent commit at the time of writing. Details of the implementation might change of time. For instance Timeline-Semaphores are probably used at some point instead of Fences.

Graph

A graph instance handles data reuse between graph submission. This currently includes CommandBuffer reuse and Semaphore reuse. The graph can be used to instance a new GraphBuilder that records multiple (sequential) Passes.

//Build graph and execute
let execute_fence = self
  .graph
  .record()
   .insert_pass(
     "ImageAcquireWait",
     &mut wait_image,
     graphics_queue.family_index,
   )
   .insert_pass(
       "ComputePass",
      &mut self.frame_data[swimage.index as usize].compute_pass,
      graphics_queue.family_index,
   )
   .insert_pass("SwapchainBlit", &mut blit, graphics_queue.family_index)
   .insert_pass(
     "SwapchainPrepare",
     &mut present_prepare,
     graphics_queue.family_index,
   )
   .finish()
   .execute()
   .unwrap();

Theoretically the resulting graph can be optimized for certain behavior before submission. At the time of writing this is not implemented though. The execution follows the finishing of the graph immediately.

The returned ExecutionFence is a fat fence guarding all submitted command buffers, as well as keeping all used resources alive (images, buffers, descriptor sets etc. Basically every Vulkan object that has a create and destroy function).

Pass

The other important bit are the Passes. They can be seen as a single self contained process. For instance rendering a GBuffer, creating shadow-maps, or simply copying one image to another. It makes sense to have common passes implemented (like blitting one image to another, buffer copies etc.) and only let the developer implement passes that need a greater knowledge of the used renderer. In practice for my small render-graph example only the compute-shader submission pass is implemented by hand. Everything else can be build from already implemented passes of the command-graph crate.

Each pass declares a set of AssumedStates that are read while building the graph. In practice, again for the render-graph example setting up the dependencies looks like this:

//setup wait pass
let mut wait_image = WaitExternal::new(swimage.sem_acquire.clone());
//Setup compute pass
self.frame_data[swimage.index as usize]
  .compute_pass
  .push_const(push);

//setup image blit (blits final image to swapchain image) and prepare pass
let mut blit = ImageBlit::new(
  self.frame_data[swimage.index as usize]
  .compute_pass
  .target_image
  .clone(),
  st_swimage.clone(),
);
//setup prepare including the seamphore that is signaled once the pass has finished.
let mut present_prepare = SwapchainPrepare::new(st_swimage, swimage.sem_present.clone());

Note that the user can choose to either create a pass per frame, or cache the pass for multiple submission. Depending on what the pass does either one can make sense.

Advantages of the slim approach

As you can see the user still has to declare data dependencies by hand. But it is much easier by just cloning resources into the correct pass. Transitions and synchronizing of multiple queues are handled by the graph.

I decided against a blackbox-like graph (where all data is managed by the graph). The main advantage of this more transparent type is, that the developer can choose for instance to write some parts by hand and only let the graph handle common work. Or the other way around: Hand optimize critical paths and let the graph only handle swapchain image submission and async compute.

As mentioned above the user can also choose to create passes each time or implement caching depending on the workload. This gives the freedom to implement different strategies on how to use screen-buffers for instance (one gbuffer per swapchain image, or one GBuffer and waiting for complete swapchain present before reusing?).

This freedom however comes with the assumption by the graph that the state of each resource is changed accordingly. If not UB might occur (and, if activated you will see a lot of errors in the validation layers).

Resource state handling

A detail I did not explain yet is the resource state handling. In essence all create/destroy-able devices are wrapped in an Arc pointer. So an image is always Arc<Image> for instance. This allows keeping them alive until they are dropped by the users code AND all command buffers. Images and buffers are additionally associated with a state by wrapping them in StImage or StBuffer (St for state… naming is hard!).

As mentioned before a pass declares all state it assumes a resource to be in via the assumed_state implementation. Therefore the Graph always knows the current and wanted state of each resource. Finding the correct transition is then done by analyzing the context of the transition (do we see the resource for the first time? Is it initialized or undefined? is it on another queue? etc.).

The current implementation can distinguish between

  1. uninitialized state
  2. inherited state (from another submission)
  3. in-graph intermediate state

It can happen that inherited state is not useable. For instance if a buffer is left on queue 1 without a release operation we can not correctly acquire the buffer for an operation on queue 0. In that case the buffer is reinitialized for queue 0 which makes the old data invalid. Apart from this case all other data can be transitioned correctly between queues.

Backend Overview

After talking about the user perspective it is time to explain how the actual graph is build. I tried three different approaches and settled on this one since it is simple to implement and creates reasonable optimal graphs. It also allows for a optimization stage before submitting.

Canned approaches

First

First Rendergraph

The first and simplest idea was to collect all Passes. Check their initial resource states and build a initial acquire phase that transitions the resources to the correct queue. Each pass could then be executed after each other with minimal pipeline barriers. At the end all states are released for the next graph.

This approach has two problems though.

  1. It does not handle multiple queues
  2. Acquire and release operations need to know from which to which queue a resource is released.

This queue family based order of passes however is still used in the final solution. But as I’ll explain later is wrapped by a queue-transition graph.

Second

Second Rendergraph

While the first approach allowed for sequential submitting of the graph it was not fit to handle multiple queues. The next iteration would solve this by defining the execution queue while submitting. Whenever a queue transitioned was needed the user could define a Sync for the resource. This would move queue ownership to the other queue.

This was the first actually working prototype. Sadly I squashed the render-graph commit, otherwise I could have linked the commit :/. But the API looked something like this:

let graph_fence = Graph::build()
.insert_pass("Gbuffer", .., graphics_queue)
.insert_pass("ShadowPass", .., async_compute)
.move_to_queue(shadow_image, async_compute, graphics_queue)
.insert_pass("Light", .., graphics_queue)
..
.build();

As you can see the user still has to keep track of resource transitions, but in a higher level way. But this is still what I wanted to prevent. The next and final iteration was therefore a mixture of the first and second approach.

Current graph building

The current approach uses the first graph type on a per-queue basis. I call those sub-streams Segment. Each segment contains an acquire phase, a set of passes and in-segment pipeline barriers and finally a release phase.

Queue Segment

Building the graph works by simultaneously tracking the segments for each queue family. Whenever a new pass is inserted all needed state is checked against its current state. If an inter-queue dependency (meaning a resource is needed on a different queue that it currently resides on) is found, the segments of both queues (the from and to queues segments) are finished. This means that for each dependency of a segment an acquire-phase is build, and for each dependee a release phase is build. The process is hopefully explained below:

Graph Building

In practice the release operations are delayed as much as possible and the acquire operations are as early as possible. This allows the graph to collect multiple queue transitions in one place. The queue transitions can now be found while inserting a pass no move_to_pass call is needed anymore.

Closing

For now this simple graph seems to work nicely for smaller applications. Obviously more complex implementations allow for much more sophisticated resource handling. As specially if the resources are handled by the graph directly, not from outside the graph. Stuff like temporary images or reuse of buffers in a different context have to be done by hand in my version. But the main goal of simplifying layout transitions and inter-queue synchronization are achieved nevertheless. I am currently building a small library of useful passes, like synchronized buffer/image upload to the GPU, download of data as well as general purpose passes like tone-mapping or depth-based single pass blur. Those will be merged into MarpII’s main branch at some point. The shaders are hopefully released as separate rust-gpu crates.

As always, if you have suggestions, contact me on one of the channels listed on the index page.