Supported Features

This page is used for tracking Cargo/Rust and CUDA features that are currently supported or planned to be supported in the future. As well as tracking some information about how they could be supported.

Note that Not supported does not mean it won't ever be supported, it just means we haven't gotten around to adding it yet.

Indicator	Meaning
➖	Not Applicable
❌	Not Supported
✔️	Fully Supported
🟨	Partially Supported

Rust Features

Feature Name	Support Level	Notes
Opt-Levels	✔️	behaves mostly the same (because llvm is still used for optimizations). Except that libnvvm opts are run on anything except no-opts because nvvm only has -O0 and -O3
codegen-units	✔️
LTO	➖	we load bitcode modules lazily using dependency graphs, which then forms a single module optimized by libnvvm, so all the benefits of LTO are on without pre-libnvvm LTO being needed.
Closures	✔️
Enums	✔️
Loops	✔️
If	✔️
Match	✔️
Proc Macros	✔️
Try (`?`)	✔️
128 bit integers	🟨	Basic ops should work (and are emulated), advanced intrinsics like `ctpop`, `rotate`, etc are unsupported.
Unions	✔️
Iterators	✔️
Dynamic Dispatch	✔️
Pointer Casts	✔️
Unsized Slices	✔️
Alloc	✔️
Printing	✔️
Panicking	✔️	Currently just traps (aborts) because of weird printing failures in the panic handler
Float Ops	✔️	Maps to libdevice intrinsics, calls to libm are not intercepted though, which we may want to do in the future
Atomics	❌

CUDA Libraries

Library Name	Support Level	Notes
CUDA Runtime API	➖	The CUDA Runtime API is for CUDA C++, we use the driver API
CUDA Driver API	🟨	Most functions are implemented, but there is still a lot left to wrap because it is gigantic
cuBLAS	❌	In-progress
cuFFT	❌
cuSOLVER	❌
cuRAND	➖	cuRAND only works with the runtime API, we have our own general purpose GPU rand library called `gpu_rand`
cuDNN	❌	In-progress
cuSPARSE	❌
AmgX	❌
cuTENSOR	❌
OptiX	🟨	CPU OptiX is mostly complete, GPU OptiX is still heavily in-progress because it needs support from the codegen

GPU-side Features

Note: Most of these categories are used very rarely in CUDA code, therefore do not be alarmed that it seems like many things are not supported. We just focus on things used by the wide majority of users.

Feature Name	Support Level	Notes
Function Execution Space Specifiers	➖
Variable Memory Space Specifiers	✔️	Handled Implicitly but can be explicitly stated for statics with `#[address_space(...)]`
Built-in Vector Types	➖	Use linear algebra libraries like vek or glam
Built-in Variables	✔️
Memory Fence Instructions	✔️
Synchronization Functions	✔️
Mathematical Functions	🟨	Less common functions like native f16 math are not supported
Texture Functions	❌
Surface Functions	❌
Read-Only Data Cache Load Function	❌	No real need, immutable references hint this automatically
Load Functions Using Cache Hints	❌
Store Functions Using Cache Hints	❌
Time Function	✔️
Atomic Functions	❌
Address Space Predicate Functions	✔️	Address Spaces are implicitly handled, but they may be added for exotic interop with CUDA C/C++
Address Space Conversion Functions	✔️
Alloca Function	➖
Compiler Optimization Hint Functions	➖	Existing `core` hints work
Warp Vote Functions	❌
Warp Match Functions	❌
Warp Reduce Functions	❌
Warp Shuffle Functions	❌
Nanosleep	✔️
Warp Matrix Functions (Tensor Cores)	❌
Asynchronous Barrier	❌
Asynchronous Data Copies	❌
Profiler Counter Function	✔️
Assertion	✔️
Trap Function	✔️
Breakpoint	✔️
Formatted Output	✔️
Dynamic Global Memory Allocation	✔️
Execution Configuration	✔️
Launch Bounds	❌
Pragma Unroll	❌
SIMD Video Instructions	❌
Cooperative Groups	❌
Dynamic Parallelism	❌
Stream Ordered Memory	✔️
Graph Memory Nodes	❌
Unified Memory	✔️
`__restrict__`	➖	Not needed, you get that performance boost automatically through rust's noalias :)

Frequently Asked Questions

This page will cover a lot of the questions people often have when they encounter this project, so they are addressed all at once.

Why not use rustc with the LLVM PTX backend?

Good question, a good amount of reasons:

The LLVM PTX backend is still very much WIP and often doesn't have things and/or breaks.
Due to odd dylib issues, the LLVM PTX backend does not work on windows, it will fail to link in intrinsics. This can be circumvented by building LLVM in a special way, but this is far beyond what most users will do to get a backend to work.
NVVM is used in NVCC itself, therefore NVIDIA is much more receptive to bugs inside of it.
NVVM contains proprietary optimizations (which is why it's closed source) that are simply not present in the LLVM PTX backend which yield considerable performance differences (especially on more complex kernels with more information in the IR).
For some reason (either rustc giving weird LLVM IR or the LLVM PTX backend being broken) the LLVM PTX backend often generates completely invalid PTX for trivial programs, so it is not an acceptable workflow for a production pipeline.
GPU and CPU codegen is fundamentally different, creating a codegen that is only for the GPU allows us to seamlessly implement features which would have been impossible or very difficult to implement in the existing codegen, such as:
- Shared memory, this requires some special generation of globals with custom addrspaces, its just not possible to do without backend explicit handling.
- Custom linking logic to do dead code elimination so as to not end up with large PTX files full of dead functions/globals.
- Stripping away everything we do not need, no complex ABI handling, no shared lib handling, control over how function calls are generated, etc.

So overall, the LLVM PTX backend is fit for smaller kernels/projects/proofs of concept. It is however not fit for compiling an entire language (core is very big) with dependencies and more. The end goal is for rust to be able to be used over CUDA C/C++ with the same (or better!) performance and features, therefore, we must take advantage of all optimizations NVCC has over us.

If NVVM IR is a subset of LLVM IR, can we not give rustc-generated LLVM IR to NVVM?

Short answer, no.

Long answer, there are a couple of things that make this impossible:

At the time of writing, libnvvm expects LLVM 7 bitcode, giving it LLVM 12/13 bitcode (which is what rustc uses) does not work.
NVVM IR is a subset of LLVM IR, there are tons of things that nvvm will not accept. Such as a lot of function attrs not being allowed. This is well documented and you can find the spec here. Not to mention many bugs in libnvvm that i have found along the way, the most infuriating of which is nvvm not accepting integer types that arent i1, i8, i16, i32, or i64. This required special handling in the codegen to convert these "irregular" types into vector types.

What is the point of using Rust if a lot of things in kernels are unsafe?

This is probably the most asked question by far, so let's break it down in detail.

TL;DR There are things we fundamentally can't check, but just because that is the case does not mean we cannot still prevent a lot of problems we can check.

Yes it is true that GPU kernels have much more unsafe than CPU code usually, but why is that?

The reason is that CUDA's entire model is not based on safety in any way, there are almost zero safety nets in CUDA. Rust is the polar opposite of this model, everything is safe unless there are some invariants that cannot be checked by the compiler. Let's take a look at some of the invariants we face here.

Take this program as an example, written in CUDA C++:

__global__ void kernel(int* buf, int* other)
{
  int idx = threadIdx.x;
  buf[idx] = other[idx];
}

int main(void)
{
  int N = 50;
  int* a, b, d_a, d_b;
  a = (int*)malloc(N*sizeof(int));
  b = (int*)malloc(N*sizeof(int));

  cudaMalloc(&d_a, N*sizeof(int));
  cudaMalloc(&d_b, N*sizeof(int));

  for (int i = 0; i < N; i++) {
    a[i] = 0.0f;
    b[i] = 2.0f;
  }

  cudaMemcpy(d_a, a, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, b, N*sizeof(float), cudaMemcpyHostToDevice);

  kernel<<<1, N>>>(d_a, d_b);

  cudaMemcpy(d_a, a, N*sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(d_b, b, N*sizeof(float), cudaMemcpyDeviceToHost);

  /* do something with the data */

  cudaFree(d_a);
  cudaFree(d_b);
  free(a);
  free(b);
}

You may think this looks innocent enough, it's a very easy and understandable program. But if you really think about it, this is a minefield of things that could go wrong. Let's list most of them:

buf could be too small, that is undefined behavior (reading beyond allocated memory)
similarly, other could also be too small.
The kernel could have been called with too many or not enough parameters.
The kernel could have been called with a different grid/block dimension than expected, which would cause a data race.
Any of the cudaMalloc, cudaMemcpy, kernel launches, or cudaFree calls could have errored, which we dont handle and simply ignore.
We could have forgotten to initialize the buffers.
We could have forgotten to free the buffers.

This goes to show that CUDA C/C++ and CUDA overall rely on shifting the burden of correctness from the API to the developer. However, Rust uses a completely opposite design model, the compiler verifies as much as it can, and burden is only shifted to the developer if its absolutely essential, behind unsafe.

This creates a big problem for us, it is very difficult (and sometimes impossible) to prove correctness statically when wrapping how CUDA works. We can solve a lot of the points using things like RAII and providing a high level wrapper, but we fundamentally cannot prove a lot of things, the most common place where this is shown is the CPU-GPU boundary, e.g. launching kernels.

Firstly, we cannot verify that the PTX we are calling is sound, that it has no data races, writes into the right buffers, doesnt rely on undocumented invariants, and does not write invalid data to buffers. This already makes launching kernels perma-unsafe.

Second, CUDA does zero validation in terms of kernel parameter mismatch, it will simply segfault on you, or even keep going but produce invalid data (or cause the kernel to cause undefined behavior). This is a design flaw in CUDA itself, we have no control over it and no 100% reliable way to fix it, therefore we must shift this burden of correctness to the developer.

Moreover, the CUDA GPU kernel model is entirely based on trust, trusting each thread to index into the correct place in buffers, trusting the caller of the kernel to uphold some dimension invariants, etc. This is once again, completely incompatible with how rust does things. We can provide wrappers to calculate an index that always works, and macros to index a buffer automatically, but indexing in complex ways is a core operation in CUDA and it is impossible for us to prove that whatever the developer is doing is correct.

Finally, We would love to be able to use mut refs in kernel parameters, but this is would be unsound. Because each kernel function is technically called multiple times in parallel with the same parameters, we would be aliasing the mutable ref, which Rustc declares as unsound (aliasing mechanics). So raw pointers or slightly-less-unsafe need to be used. However, they are usually only used for the initial buffer indexing, after which you can turn them into a mutable reference just fine (because you indexed in a way where no other thread will index that element). Also note that shared refs can be used as parameters just fine.

Now that we outlined why this is a thing, why is using rust a benefit if we still need to use unsafe?

Well it's simple, eliminating most of the things that a developer needs to think about to have a safe program is still exponentially safer than leaving everything to the developer to think about.

By using rust, we eliminate:

The forgotten/unhandled CUDA errors problem (yay results!).
The uninitialized memory problem.
The forgetting to dealloc memory problem.
All of the inherent C++ problems in the kernel beyond the initial buffer indexing.
The mismatched grid/block dimension problem (by providing thread::index).
The forgetting to memcpy data back problem.

And countless other problems with things like graphs, streams, devices, etc.

So, just because we cannot solve every problem with CUDA safety, does not mean we cannot solve a lot of them, and ease the burden of correctness from the developer.

Besides, using Rust only adds to safety, it does not make CUDA more unsafe. This means there are only things to gain in terms of safety using Rust.

Why not use rust-gpu with compute shaders?

The reasoning for this is the same reasoning as to why you would use CUDA over opengl/vulkan compute shaders:

CUDA usually outperforms shaders if kernels are written well and launch configurations are optimal.
CUDA has many useful features such as shared memory, unified memory, graphs, fine grained thread control, streams, the PTX ISA, etc.
rust-gpu does not perform many optimizations, and with cg_ssa's less than ideal codegen, the optimizations by llvm and libnvvm are needed.
SPIRV is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed. While libnvvm (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified.
rust-gpu is primarily focused on graphical shaders, compute shaders are secondary, which the rust ecosystem needs, but it also needs a project 100% focused on computing, and computing only.
SPIRV cannot access many useful CUDA libraries such as Optix, cuDNN, cuBLAS, etc.
SPIRV debug info is still very young and rust-gpu cannot generate it. While rustc_codegen_nvvm does, which can be used for profiling kernels in something like nsight compute.

Moreover, CUDA is the primary tool used in big computing industries such as VFX and scientific computing. Therefore it is much easier for CUDA C++ users to use rust for GPU computing if most of the concepts are still the same. Plus, we can interface with existing CUDA code by compiling it to PTX then linking it with our rust code using the CUDA linker API (which is exposed in a high level wrapper in cust).

Why use the CUDA Driver API over the Runtime API?

Simply put, the driver API provides better control over concurrency, context, and module management, and overall has better performance control than the runtime API.

Let's break it down into the main new concepts introduced in the Driver API.

Contexts

The first big difference in the driver API is that CUDA context management is explicit and not implicit.

Contexts are similar to CPU processes, they manage all of the resources, streams, allocations, etc associated with operations done inside them.

The driver API provides control over these contexts. You can create new contexts and drop them at any time. As opposed to the runtime API which works off of an implicit context destroyed on device reset. This causes a problem for larger applications because a new integration of CUDA could call device reset when it is finished, which causes further uses of CUDA to fail.

Modules

Modules are the second big difference in the driver API. Modules are similar to shared libraries, they contain all of the globals and functions (kernels) inside of a PTX/cubin file. The driver API is language-agnostic, it purely works off of ptx/cubin files. To answer why this is important we need to cover what cubins and ptx files are briefly.

PTX is a low level assembly-like language which is the penultimate step before what the GPU actually executes. It is human-readable and you can dump it from a CUDA C++ program with nvcc ./file.cu --ptx. This PTX is then optimized and lowered into a final format called SASS (Source and Assembly) and turned into a cubin (CUDA binary) file.

Driver API modules can be loaded as either ptx, cubin, or fatbin files. If they are loaded as ptx then the driver API will JIT compile the PTX to cubin then cache it. You can also compile ptx to cubin yourself using ptx-compiler and cache it.

This pipeline provides much better control over what functions you actually need to load and cache. You can separate different functions into different modules you can load dynamically (and even dynamically reload). This can yield considerable performance benefits when dealing with a lot of functions.

Streams

Streams are (one of) CUDA's way of dispatching multiple kernels in parallel. You can kind of think of them as OS threads essentially. Kernels dispatched one after the other inside of a particular stream will execute one after the other on the GPU, which is helpful for kernels that rely on a previous kernel's result.

The CUDA runtime API operates off of a single global stream. This causes a lot of issues for users of large programs or libraries that need to manage many kernels being dispatched at the same time as efficiently as possible.

Why target NVIDIA GPUs only instead of using something that can work on AMD?

This is a complex issue with many arguments for both sides, so i will give you both sides as well as my opinion.

Pros for using OpenCL over CUDA:

OpenCL (mostly) works on everything because it is a specification, not an actual centralized tool.
OpenCL will be decently fast on most systems.

Cons for using OpenCL over CUDA:

Just like all open specifications, not every implementation is as good or supports the same things. Just because the absolute basics work, does not mean more exotic features work on everything because some vendors may lag behind others.
OpenCL is slow to add new features, this is a natural consequence of being an open specification many vendors need to implement. For example, OpenCL 3.0 (which was announced in around April 2020) is supported by basically nobody. NVIDIA cards support OpenCL 2.0 while AMD cards support OpenCL 2.1. This means new features cannot be reliably relied upon because they are unlikely to work on a lot of cards for a LONG time.
OpenCL can only be written in OpenCL C (based on C99), OpenCL C++ is a thing, but again, not everything supports it. This makes complex programs more difficult to create.
OpenCL has less tools and libraries.
OpenCL is nowhere near as language-agnostic as CUDA. CUDA works almost fully off of an assembly format (ptx) and debug info. Essentially how CPU code works. This makes writing language-agnostic things in OpenCL near impossible and locks you into using OpenCL C.
OpenCL is plagued with serious driver bugs which have not been fixed, or that occur only on certain vendors.

Pros for using CUDA over OpenCL:

CUDA is for the most part the industry-standard tool for "higher level" computing such as scientific or VFX computing.
CUDA is a proprietary tool, meaning that NVIDIA is able to push out bug fixes and features much faster than releasing a new spec and waiting for vendors to implement it. This allows for more features being added, such as cooperative kernels, cuda graphs, unified memory, new profilers, etc.
CUDA is a single entity, meaning that if something does or does not work on one system it is unlikely that that will be different on another system. Assuming you are not using different architectures, where one gpu may be lacking a feature.
CUDA is usually 10-30% faster than OpenCL overall, this is likely due to subpar OpenCL drivers by NVIDIA, but it is unlikely this performance gap will change in the near future.
CUDA has a much richer set of libraries and tools than OpenCL, such as cuFFT, cuBLAS, cuRand, cuDNN, OptiX, NSight Compute, cuFile, etc.
You can seamlessly use existing CUDA C/C++ code with cust or rustc_codegen_nvvm-generated PTX by using the CUDA linker APIs which are exposed in cust. Allowing for incremental switching to Rust.
There is a generally larger set of code samples in CUDA C/C++ over OpenCL.
Documentation is far better, there are (mostly) complete API docs for every single CUDA library and function out there.
CUDA generally offers more control over the internals of how CUDA executes your GPU code. For example, you can choose to keep PTX which uses a virtual architecture, or you can compile that to cubin (SASS) and cache that for faster load times.

Cons for using CUDA over OpenCL:

CUDA only works on NVIDIA GPUs.

What makes cust and RustaCUDA different?

Cust is a fork of rustacuda which changes a lot of things inside of it, as well as adds new features that are not inside of rustacuda.

The most significant changes (This list is not complete!!) are:

Drop code no longer panics on failure to drop raw CUDA handles, this is so that InvalidAddress errors, which cause CUDA to nuke the driver and nuke any memory allocations no longer cause piles of panics from device boxes trying to be dropped when returning from the function with ?.
cuda-sys is no longer used, instead, we have our own bindings cust_raw so we can ensure updates to the latest CUDA features.
CUDA occupancy functions have been added.
PTX linking functions have been added.
Native support for vek linear algebra types for grid/block dimensions and DeviceCopy has been added under the vek feature.
Util traits have been added.
Basic graph support has been added.
Some functions have been renamed.
Some functions have been added.

Changes that are currently in progress but not done/experimental:

Surfaces
Textures
Graphs
PTX validation

Just like rustacuda, cust makes no assumptions of what language was used to generate the ptx/cubin. It could be C, C++, futhark, or best of all, Rust!

Cust's name is literally just rust + cuda mashed together in a horrible way. Or you can pretend it stands for custard if you really like custard.

Guide

Getting Started

This section covers how to get started writing GPU crates with cuda_std and cuda_builder.

Required Libraries

Before you can use the project to write GPU crates, you will need a couple of prerequisites:

The CUDA SDK, version 11.2-11.8 (and the appropriate driver - see cuda release notes).
- We recently added experimental support for the 12.x SDK, please file any issues you see
This is only for building GPU crates, to execute built PTX you only need CUDA 9+.
LLVM 7.x (7.0 to 7.4), The codegen searches multiple places for LLVM:
- If LLVM_CONFIG is present, it will use that path as llvm-config.
- Or, if llvm-config is present as a binary, it will use that, assuming that llvm-config --version returns 7.x.x.
- Finally, if neither are present or unusable, it will attempt to download and use prebuilt LLVM. This currently only works on Windows however.
The OptiX SDK if using the optix library (the pathtracer example uses it for denoising).
You may also need to add libnvvm to PATH, the builder should do it for you but in case it does not work, add libnvvm to PATH, it should be somewhere like CUDA_ROOT/nvvm/bin,
You may wish to use or consult the bundled Dockerfile to assist in your local config

rust-toolchain

Currently, the Codegen only works on nightly (because it uses rustc internals), and it only works on a specific version of nightly. This is why you must copy the rust-toolchain file in the project repository to your own project. This will ensure you are on the correct nightly version so the codegen builds.

Only the codegen requires nightly, cust and other CPU-side libraries work perfectly fine on stable.

Cargo.toml

Now we can actually get started creating our GPU crate 🎉

Start by making a normal crate as you normally would, manually or with cargo init: cargo init name --lib.

After this, we just need to add a couple of things to our Cargo.toml:

[package]
name = "name"
version = "0.1.0"
edition = "2021"

+[lib]
+crate-type = ["cdylib", "rlib"]

[dependencies]
+cuda_std = "XX"

Where XX is the latest version of cuda_std.

We changed our crate's crate types to cdylib and rlib. We specified cdylib because the nvptx targets do not support binary crate types. rlib is so that we will be able to use the crate as a dependency, such as if we would like to use it on the CPU.

lib.rs

Before we can write any GPU kernels, we must add a few directives to our lib.rs which are required by the codegen:

#![cfg_attr(
    target_os = "cuda",
    no_std,
    register_attr(nvvm_internal)
)]

use cuda_std::*;

This does a couple of things:

It only applies the attributes if we are compiling the crate for the GPU (target_os = "cuda").
It declares the crate to be no_std on CUDA targets.
It registers a special attribute required by the codegen for things like figuring out what functions are GPU kernels.
It explicitly includes kernel macro and thread

If you would like to use alloc or things like printing from GPU kernels (which requires alloc) then you need to declare alloc too:

extern crate alloc;

Finally, if you would like to use types such as slices or arrays inside of GPU kernels you must allow improper_cytypes_definitions either on the whole crate or the individual GPU kernels. This is because on the CPU, such types are not guaranteed to be passed a certain way, so they should not be used in extern "C" functions (which is what kernels are implicitly declared as). However, rustc_codegen_nvvm guarantees the way in which things like structs, slices, and arrays are passed. See Kernel ABI.

#![allow(improper_ctypes_definitions)]

Writing our first GPU kernel

Now we can finally start writing an actual GPU kernel.

Expand this section if you are not familiar with how GPU-side CUDA works

Firstly, we must explain a couple of things about GPU kernels, specifically, how they are executed. GPU Kernels (functions) are the entry point for executing anything on the GPU, they are the functions which will be executed from the CPU. GPU kernels do not return anything, they write their data to buffers passed into them.

CUDA's execution model is very very complex and it is unrealistic to explain all of it in this section, but the TLDR of it is that CUDA will execute the GPU kernel once on every thread, with the number of threads being decided by the caller (the CPU).

We call these parameters the launch dimensions of the kernel. Launch dimensions are split up into two basic concepts:

Threads, a single thread executes the GPU kernel once, and it makes the index of itself available to the kernel through special registers (functions in our case).
Blocks, Blocks house multiple threads that they execute on their own. Thread indices are only unique across the thread's block, therefore CUDA also exposes the index of the current block.

One important thing to note is that block and thread dimensions may be 1d, 2d, or 3d. That is to say, i can launch 1 block of 6x6x6, 6x6, or 6 threads. I could also launch 5x5x5 blocks. This is very useful for 2d/3d applications because it makes the 2d/3d index calculations much simpler. CUDA exposes thread and block indices for each dimension through special registers. We expose thread index queries through cuda_std::thread.

Now that we know how GPU functions work, let's write a simple kernel. We will write a kernel which does [1, 2, 3, 4] + [1, 2, 3, 4] = [2, 4, 6, 8]. We will use a 1-dimensional index and use the cuda_std::thread::index_1d utility method to calculate a globally-unique thread index for us (this index is only unique if the kernel was launched with a 1d launch config!).

#[kernel]
pub unsafe fn add(a: &[f32], b: &[f32], c: *mut f32) {
    let idx = thread::index_1d() as usize;
    if idx < a.len() {
        let elem = &mut *c.add(idx);
        *elem = a[idx] + b[idx];
    }
}

If you have used CUDA C++ before, this should seem fairly familiar, with a few oddities:

Kernel functions must be unsafe currently, this is because the semantics of Rust safety on the GPU are still very much undecided. This restriction will probably be removed in the future.
We use *mut f32 and not &mut [f32]. This is because using &mut in function arguments is unsound. The reason being that rustc assumes &mut does not alias. However, because every thread gets a copy of the arguments, this would cause it to alias, thereby violating this invariant and yielding technically unsound code. Pointers do not have such an invariant on the other hand. Therefore, we use a pointer and only make a mutable reference once we are sure the elements are disjoint: let elem = &mut *c.add(idx);.
We check that the index is not out of bounds before doing anything, this is because it is common to launch kernels with thread amounts that are not exactly divisible by the length for optimization.

Internally what this does is it first checks that a couple of things are right in the kernel:

All parameters are Copy.
The function is unsafe.
The function does not return anything.

Then it declares this kernel to the codegen so that the codegen can tell CUDA this is a GPU kernel. It also applies #[no_mangle] so the name of the kernel is the same as it is declared in the code.

Building the GPU crate

Now that you have some kernels defined in a crate, you can build them easily using cuda_builder. cuda_builder is a helper crate similar to spirv_builder (if you have used rust-gpu before), it builds GPU crates while passing everything needed by rustc.

To use it you can simply add it as a build dependency in your CPU crate (the crate running the GPU kernels):

+[build-dependencies]
+cuda_builder = "XX"

Where XX is the current version of cuda_builder.

Then, you can simply invoke it in the build.rs of your CPU crate:

use cuda_builder::CudaBuilder;

fn main() {
    CudaBuilder::new("path/to/gpu/crate/root")
        .copy_to("some/path.ptx")
        .build()
        .unwrap();
}

The first argument is the path to the root of the GPU crate you are trying to build, which would probably be ../name in our case. The second function .copy_to(path) tells the builder to copy the built PTX file somewhere. By default the builder puts the PTX file inside of target/cuda-builder/nvptx64-nvidia-cuda/release/crate_name.ptx, but it is usually helpful to copy it to another path, which is what such method does. Finally, build() actually runs rustc to compile the crate. This may take a while since it needs to build things like core from scratch, but after the first compile, incremental will make it much faster.

You can also specify a different compute capability with .arch():

CudaBuilder::new("path/to/gpu/crate")
    .arch(cuda_builder::NvvmArch::Compute75)  // Target compute 7.5 GPUs
    .copy_to("kernel.ptx")
    .build()
    .unwrap();

The architecture you choose affects which GPU features are available. See the Compute Capability Gating guide for details on writing code that adapts to different GPU capabilities.

Finally, you can include the PTX as a static string in your program:

static PTX: &str = include_str!("some/path.ptx");

Then execute it using cust.

Don't forget to include the current rust-toolchain in the top of your project:

# If you see this, run `rustup self update` to get rustup 1.23 or newer.

# NOTE: above comment is for older `rustup` (before TOML support was added),
# which will treat the first line as the toolchain name, and therefore show it
# to the user in the error, instead of "error: invalid channel name '[toolchain]'".

[toolchain]
channel = "nightly-2021-12-04"
components = ["rust-src", "rustc-dev", "llvm-tools-preview"]

Docker

There is also a Dockerfile prepared as a quickstart with all the necessary libraries for base cuda development.

You can use it as follows (assuming your clone of Rust-CUDA is at the absolute path RUST_CUDA):

Ensure you have Docker setup to use gpus
Build docker build -t rust-cuda $RUST_CUDA
Run docker run -it --gpus all -v $RUST_CUDA:/root/rust-cuda --entrypoint /bin/bash rust-cuda
- Running will drop you into the container's shell and you will find the project at ~/rust-cuda
If all is well, you'll be able to cargo run in ~/rust-cuda/examples/cuda/cpu/add

Notes:

refer to rust-toolchain to ensure you are using the correct toolchain in your project.
despite using Docker, your machine will still need to be running a compatible driver, in this case for Cuda 11.4.1 it is >=470.57.02
if you have issues within the container, it can help to start ensuring your gpu is recognized
- ensure nvidia-smi provides meaningful output in the container
- NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try makeing and running the deviceQuery sample. If all is well you should see many details about your gpu

Compute Capability Gating

This section covers how to write code that adapts to different CUDA compute capabilities using conditional compilation.

What are Compute Capabilities?

CUDA GPUs have different "compute capabilities" that determine which features they support. Each capability is identified by a version number like 3.5, 5.0, 6.1, 7.5, etc. Higher numbers generally mean more features are available.

For example:

Compute capability 5.0+ supports 64-bit integer min/max and bitwise atomic operations
Compute capability 6.0+ supports double-precision (f64) atomic operations
Compute capability 7.0+ supports tensor core operations

For comprehensive details, see NVIDIA's CUDA documentation on GPU architectures.

Virtual vs Real Architectures

In CUDA terminology:

Virtual architectures (compute_XX) define the PTX instruction set and available features
Real architectures (sm_XX) represent actual GPU hardware

rust-cuda works exclusively with virtual architectures since it only generates PTX. The NvvmArch::ComputeXX enum values correspond to CUDA's virtual architectures.

Using Target Features

When building your kernel, the NvvmArch::ComputeXX variant you choose enables specific target_feature flags. These can be used with #[cfg(...)] to conditionally compile code based on the capabilities of the target GPU.

For example, this checks whether the target architecture supports running compute 6.0 code or newer:

#![allow(unused)]
fn main() {
#[cfg(target_feature = "compute_60")]
}

Think of it as asking: “Is the GPU I’m building for at least compute 6.0?” Depending on which NvvmArch::ComputeXX is used to build the kernel, there is a different answer:

Building for Compute60 → ✓ Yes (exact match)
Building for Compute70 → ✓ Yes (7.0 GPUs support 6.0 code)
Building for Compute50 → ✗ No (5.0 GPUs can't run 6.0 code)

These features let you write optimized code paths for specific GPU generations while still supporting older ones.

Specifying Compute Capabilites

Starting with CUDA 12.9, NVIDIA introduced architecture suffixes that affect compatibility.

Base Architecture (No Suffix)

Example: NvvmArch::Compute70

This is everything mentioned above, and was the only option in CUDA 12.8 and lower.

When to use: Default choice for maximum compatibility.

Example usage:

#![allow(unused)]
fn main() {
// In build.rs
CudaBuilder::new("kernels")
    .arch(NvvmArch::Compute70)
    .build()
    .unwrap();

// In your kernel code:
#[cfg(target_feature = "compute_60")]  // ✓ Pass (older compute capability)
#[cfg(target_feature = "compute_70")]  // ✓ Pass (current compute capability)
#[cfg(target_feature = "compute_80")]  // ✗ Fail (newer compute capability)
}

Family Suffix ('f')

Example: NvvmArch::Compute101f

Specifies code compatible with the same major compute capability version and with an equal or higher minor compute capability version.

When to use: When you need features from a specific minor version but want forward compatibility within the family.

Example usage:

#![allow(unused)]
fn main() {
// In build.rs
CudaBuilder::new("kernels")
    .arch(NvvmArch::Compute101f)
    .build()
    .unwrap();

// In your kernel code:
#[cfg(target_feature = "compute_100")]   // ✗ Fail (10.0 < 10.1)
#[cfg(target_feature = "compute_101")]   // ✓ Pass (equal major, equal minor)
#[cfg(target_feature = "compute_103")]   // ✓ Pass (equal major, greater minor)
#[cfg(target_feature = "compute_101f")]  // ✓ Pass (the 'f' variant itself)
#[cfg(target_feature = "compute_100f")]  // ✗ Fail (other 'f' variant)
#[cfg(target_feature = "compute_90")]    // ✗ Fail (different major)
#[cfg(target_feature = "compute_110")]   // ✗ Fail (different major)
}

Architecture Suffix ('a')

Example: NvvmArch::Compute100a

Specifies code that only runs on GPUs of that specific compute capability and no others. However, during compilation, it enables all available instructions for the architecture, including all base variants up to the same version and all family variants with the same major version and equal or lower minor version.

When to use: When you need to use architecture-specific features (like certain Tensor Core operations) that are only available on that exact GPU model.

Example usage:

#![allow(unused)]
fn main() {
// In build.rs
CudaBuilder::new("kernels")
    .arch(NvvmArch::Compute100a)
    .build()
    .unwrap();

// In your kernel code:
#[cfg(target_feature = "compute_100a")]  // ✓ Pass (the 'a' variant itself)
#[cfg(target_feature = "compute_100")]   // ✓ Pass (base variant)
#[cfg(target_feature = "compute_90")]    // ✓ Pass (lower base variant)
#[cfg(target_feature = "compute_100f")]  // ✓ Pass (family variant with same major/minor)
#[cfg(target_feature = "compute_101f")]  // ✗ Fail (family variant with higher minor)
#[cfg(target_feature = "compute_110")]   // ✗ Fail (higher major version)
}

Note: While the 'a' variant enables all these features during compilation (allowing you to use all available instructions), the generated PTX code will still only run on the exact GPU architecture specified.

For more details on suffixes, see NVIDIA's blog post on family-specific architecture features.

Manual Compilation (Without CudaBuilder)

If you're invoking rustc directly instead of using cuda_builder, you only need to specify the architecture through LLVM args:

rustc --target nvptx64-nvidia-cuda \
    -C llvm-args=-arch=compute_61 \
    -Z codegen-backend=/path/to/librustc_codegen_nvvm.so \
    ...

Or with cargo:

export RUSTFLAGS="-C llvm-args=-arch=compute_61 -Z codegen-backend=/path/to/librustc_codegen_nvvm.so"
cargo build --target nvptx64-nvidia-cuda

The codegen backend automatically synthesizes target features based on the architecture type as described above.

Common Patterns for Base Architectures

These patterns work when using base architectures (no suffix), which enable all lower capabilities:

At Least a Capability (Default)

#![allow(unused)]
fn main() {
// Code that requires compute 6.0 or higher
#[cfg(target_feature = "compute_60")]
{
    cuda_std::atomic::atomic_add(data, 1.0); // f64 atomics need 6.0+
}
}

Exactly One Capability

#![allow(unused)]
fn main() {
// Code that targets exactly compute 6.1 (not 6.2+)
#[cfg(all(target_feature = "compute_61", not(target_feature = "compute_62")))]
{
    // Features specific to compute 6.1
}
}

Up To a Maximum Capability

#![allow(unused)]
fn main() {
// Code that works up to compute 6.0 (not 6.1+)
#[cfg(all(target_feature = "compute_35", not(target_feature = "compute_61")))]
{
    // Maximum compatibility implementation
}
}

Targeting Specific Architecture Ranges

#![allow(unused)]
fn main() {
// This block compiles when building for architectures >= 6.0 but < 8.0
#[cfg(all(target_feature = "compute_60", not(target_feature = "compute_80")))]
{
    // Code here can use features from 6.0+ but must not use 8.0+ features
}
}

Debugging Capability Issues

If you encounter errors about missing functions or features:

Check the compute capability you're targeting in cuda_builder
Verify your GPU supports the features you're using
Use nvidia-smi to check your GPU's compute capability
Add appropriate #[cfg] guards or increase the target architecture

Runtime Behavior

Again, rust-cuda only generates PTX, not pre-compiled GPU binaries ("fatbinaries"). This PTX is then JIT-compiled by the CUDA driver at runtime.

For more details, see NVIDIA's documentation on GPU compilation and JIT compilation.

Tips

This section contains some tips on what to do and what not to do using the project.

GPU kernels

Generally don't derive Debug for structs in GPU crates. The codegen currently does not do much global DCE (dead code elimination) so debug can really slow down compile times and make the PTX gigantic. This will get much better in the future but currently it will cause some undesirable effects.
Don't use recursion, CUDA allows it but threads have very limited stacks (local memory) and stack overflows yield confusing InvalidAddress errors. If you are getting such an error, run the executable in cuda-memcheck, it should yield a write failure to Local memory at an address of about 16mb. You can also put the ptx file through cuobjdump and it should yield ptxas warnings for functions without a statically known stack usage.

Kernel ABI

This section details how parameters are passed to GPU kernels by the Codegen at the current time. In other words, how the codegen expects you to pass different types to GPU kernels from the CPU.

⚠️ If you find any bugs in the ABI please report them. ⚠️

Preface

Please note that the following only applies to non-rust call conventions, we make zero guarantees about the rust call convention, just like rustc.

While we currently override every ABI except rust, you should generally only use "C", any other ABI we override purely to avoid footguns.

Functions marked as #[kernel] are enforced to be extern "C" by the kernel macro, and it is expected that all GPU kernels be extern "C", not that you should be declaring any kernels without the #[kernel] macro, because the codegen/cuda_std is allowed to rely on the behavior of #[kernel] for correctness.

Structs

Structs are always passed directly using byte arrays if they are passed by value in the function. This corresponds to what is expected by CUDA/the PTX ABI.

For example:

#[derive(Clone, Copy)]
#[repr(C)]
pub struct Foo {
    pub a: u16,
    pub b: u64,
    pub c: u128,
}

#[kernel]
pub unsafe fn kernel(a: Foo) {
    /* ... */
}

will map to the following PTX:

.visible .entry kernel(
	.param .align 16 .b8 kernel_param_0[32]
)

Consequently, it is expected that you will pass the struct by value when launching the kernel, and not by reference (by allocating a device box):

let foo = Foo { 
  a: 5,
  b: 6,
  c: 7
};

unsafe {
  launch!(
    module.kernel<<<1, 1, 0, stream>>>(foo)
  )?;
}

And not

let foo = DeviceBox::new(Foo { 
  a: 5,
  b: 6,
  c: 7
});

unsafe {
  launch!(
    module.kernel<<<1, 1, 0, stream>>>(foo.as_device_ptr())
  )?;
}

Arrays

Arrays are passed the same as if they were structs, they are always passed by value as byte arrays.

Slices

Slices are passed as two parameters, both 32-bit on nvptx or 64-bit on nvptx64. The first parameter is the pointer to the beginning of the data, and the second parameter is the length of the slice.

For example:

#[kernel]
pub unsafe fn kernel(a: &[u8]) {
  /* ... */
}

Will map to the following PTX (on nvptx64):

.visible .entry kernel(
	.param .u64 kernel_param_0,
	.param .u64 kernel_param_1
)

Consequently, it is expected that you will pass the pointer and the length as multiple parameters when calling the kernel:

let mut buf = [5u8; 10].as_dbuf()?;

unsafe {
  launch!(
    module.kernel<<<1, 1, 0, stream>>>(buf.as_device_ptr(), buf.len())
  )?;
}

You may get warnings about slices being an improper C-type, but the warnings are safe to ignore, the codegen guarantees that slices are passed as pairs of params.

You cannot however pass mutable slices, this is because it would violate aliasing rules, each thread receiving a copy of the mutable slice would violate aliasing rules. You may use a &[UnsafeCell<T>] then convert an element to a mutable ref (once you know the element accesses are disjoint), or more commonly, use a raw pointer.

ZSTs

ZSTs (zero-sized types) are ignored and become nothing in the final PTX.

Primitives

Primitive types are passed directly by value, same as structs. They map to the special PTX types .s8, .s16, .s32, .s64, .u8, .u16, .u32, .u64, .f32, and .f64. With the exception that u128 and i128 are passed as byte arrays (but this has no impact on how they are passed from the CPU).

References And Pointers

References and Pointers are both passed as expected, as pointers. It is therefore expected that you pass such parameters using device memory:

#[kernel]
pub unsafe fn kernel(a: &u8) {
  /* ... */
}

let mut val = DeviceBox::new(&5)?;

unsafe {
  launch!(
    module.kernel<<<1, 1, 0, stream>>>(val.as_device_ptr())
  )?;
}

repr(Rust) Types

using repr(Rust) types inside of kernels is not disallowed but it is highly discouraged. This is because rustc is allowed to switch up how the types are represented across compiler invocations which leads to hard to track errors.

Therefore, you should generally only use repr(C) inside of kernel parameters. With the exception of slices that have a guaranteed parameter layout.

Safety

With one of Rust's main foci being memory safety, we strive to make most things safe, without requiring too much unsafe usage and mental checks from the user. However, CUDA's inherent thread/memory model leaves many things ambiguous as to whether they are sound and makes many invariants inherently impossible to statically prove. In this section we will talk about what kinds of behavior is considered undefined inside of kernels as well as the invariants that must be upheld by the caller of kernels.

⚠️ This list is not fully complete, as the semantics of Rust safety on the GPU have not been explored much, however, this list includes most of the actions "regular" users may commit ⚠️

Behavior considered undefined in GPU kernels

This list is purely meant to be a guide on what behavior should be avoided.

Undefined behavior on the GPU is defined as potentially being able to cause the following (but not limited to):

Unknown/Undefined data being written to a location in memory.
Causing fatal termination of either just the kernel (through trapping), or the entire CUDA driver (through invalid address errors).
Causing LLVM/NVVM to optimize the code into unknown code.

Behavior considered undefined inside of GPU kernels:

Most importantly, any behavior that is considered undefined on the CPU, is considered undefined on the GPU too. See: https://doc.rust-lang.org/reference/behavior-considered-undefined.html. The only exception being invalid sizes for buffers given to a GPU kernel.

Currently we declare that the invariant that a buffer given to a gpu kernel must be large enough for any access the kernel is going to make is up to the caller of the kernel to uphold. This idiom may be changed in the future.

Any kind of data race, this has the same semantics as data races in CPU code. Such as:
- Multiple threads writing to a location in memory at the same time without synchronization.
- One or more threads reading while a thread is writing to a memory location.
- Reading shared memory while a thread is writing to the location (if for example thread::sync has not been called).

Behavior not currently considered undefined, but considered undesirable:

calling thread::sync inside of a branch that not all threads inside of the thread block have reached.

Behavior considered undefined on the CPU

This list will contain behavior that is considered undefined in the context of actually launching GPU kernels from the CPU.

Streams

Streams will always execute concurrently with eachother. That is to say, kernels launched inside of a single stream guarantee that they will be executed one after the other, in order.

However, kernels launched in different streams have no guarantee of execution order, their execution may be interleaved and kernels are likely to be launched concurrently on the GPU.

Therefore, it is undefined behavior to write to the same memory location in kernels executed in different streams without synchronization.

For example: 1: Foo is allocated as a buffer of memory on the GPU. 2: Stream 1 launches kernel bar which writes to Foo. 3: Stream 2 launches kernel bar which also writes to Foo.

This is undefined behavior because the kernels are likely to be executed concurrently, causing a data race when multiple kernels try to write to the same memory.

However, if the thread that Stream 2 is located on calls synchronize() on Stream 1 before launching the kernel, this will be sound. Because synchronize() waits for Stream 2 to finish all of its tasks before giving back control to the calling thread.

Another important detail is that GPU operations on a stream are NOT synchronized with the CPU. This means that CPU code may not rely on a kernel being finished without calling synchronize(). For example:

launch!(module.bar<<<1, 1, 0, stream>>>(foo.as_unified_ptr()))?;
// 'bar' is not guaranteed to be finished executing at this point.
function_that_accesses_foo(foo);
stream.synchronize()?;
// foo may be accessed and will see the changes that 'bar' wrote to 'foo'. 'bar' is guaranteed 
// to be finished executing.

Contexts

Contexts are akin to CPU processes, therefore, it is undefined behavior (although it should always yield an invalid address error) to access another context's allocated GPU memory.

However, this is very uncommon because single-device code should not need multiple contexts generally. This only becomes relevant when using multiple devices (multi-GPU code) with different contexts.

Note however, that unified memory can be accessed by multiple GPUs and multiple contexts at the same time, as unified memory takes care of copying and moving data automatically from GPUs/CPU when a page fault occurs. For this reason as well as general ease of use, we suggest that unified memory generally be used over regular device memory.

Kernel Launches

Kernel Launches are the most unsafe part of CUDA, many things must be checked by the developer to soundly launch a kernel. It is fundamentally impossible for us to verify a large portion of the invariants expected by the kernel/CUDA.

The following invariants must be upheld by the caller of a kernel, failure to do so is undefined behavior:

The number of parameters passed to the kernel must match the expected number of parameters.
The dimensionality expected by the kernel must match, e.g. if the kernel expects 2d thread indices, it is undefined behavior to launch the kernel with 3d thread indices (which would cause a data race). However, it is not undefined behavior to launch the kernel with a dimensionality lower than expected, e.g. launching a 2d kernel with a 1d dimensionality.
The types expected by the kernel must match:
- If the kernel expects a struct, if the struct is repr(Rust), the struct must be the actual struct from the kernel library, otherwise, if it is repr(C) (which is reccomended), the fields must all match, including alignment and order of fields.
Reference aliasing rules must not be violated, including:
- Immutable references are allowed to be aliased, e.g. if a kernel expects &T and &T, it is sound to pass the same pointer for both.
- Data behind an immutable reference must not be modified, meaning, it is undefined behavior to pass the same pointer to &T and *mut T, where *mut T is used for modifying the data.
- Parameters such as &[UnsafeCell<T>] must be exclusive, assuming the kernel uses the UnsafeCell<T> to modify the data.
- *mut T does not necessarily need to follow aliasing rules, it is sound to pass the same pointer to two *mut T parameters assuming that the kernel accesses nonoverlapping regions of the memory. If a mutable reference is formed from the pointer, the mutable reference must be exclusive, e.g. it is undefined behavior for two threads to create a mutable reference to the same element in a pointer.
Any buffers passed to the kernel must be large enough for the size that the kernel expects. Allocated buffer size being correct for what the kernel expects is up to the caller, not the kernel.
Not allocating enough dynamic shared memory for how much the kernel expects.

Behavior that is not considered undefined but is undesirable:

Launching a kernel with more threads than expected by its launch bounds (.maxntid in PTX). This will cause the launch to fail.
Launching a kernel with a different number of threads than expected by its launch bounds (.reqntid in PTX). This will also cause the launch to fail.

The CUDA Toolkit

The CUDA Toolkit is an ecosystem for executing extremely fast code on NVIDIA GPUs for the purpose of general computing.

CUDA includes many libraries for this purpose, including the Driver API, Runtime API, the PTX ISA, libnvvm, etc. CUDA is currently the best option for computing in terms of libraries and control available, however, it unfortunately only works on NVIDIA GPUs.

This section will cover some of the general uses of GPU computing, why use CUDA, and general CUDA principles. This section will not cover everything about CUDA and it is not meant to. You can check out the official CUDA guide for a complete overview.

GPU Computing

You probably already know what GPU computing is, but if you don't, it is utilizing the extremely parallel nature of GPUs for purposes other than rendering. It is widely used in many scientific and consumer fields. Some of the most common uses being fluid/smoke simulation, protein folding, physically based rendering, cryptocurrency mining, AI model training, etc.

GPUs excel at tasks that do mostly the same thing every time, and need to do it millions of times. They do not excel at so-called "divergent" tasks, tasks where each run of the task may take different amounts of time and/or take different code paths.

Why CUDA?

CUDA is currently one of the best choices for fast GPU computing for multiple reasons:

It offers deep control over how kernels are dispatched and how memory is managed.
It has a rich ecosystem of tutorials, guides, and libraries such as cuRand, cuBlas, libnvvm, optix, the PTX ISA, etc.
It is mostly unmatched in performance because it is solely meant for computing and offers rich control. And more...

However, CUDA can only run on NVIDIA GPUs, which precludes AMD gpus from tools that use it. However, this is a drawback that is acceptable by many because of the significant developer cost of supporting both NVIDIA gpus with CUDA and AMD gpus with OpenCL, since OpenCL is generally slower, clunkier, and lacks libraries and docs on par with CUDA.

Why Rust?

Rust is a great choice for GPU programming, however, it has needed a kickstart, which is what rustc_codegen_nvvm tries to accomplish; The initial hurdle of getting Rust to compile to something CUDA can run is over, now comes the design and polish part.

On top of its rich language features (macros, enums, traits, proc macros, great errors, etc), Rust's safety guarantees can be applied in gpu programming too; A field that has historically been full of implied invariants and unsafety, such as (but not limited to):

Expecting some amount of dynamic shared memory from the caller.
Expecting a certain layout for thread blocks/threads.
Manually handling the indexing of data, leaving code prone to data races if not managed correctly.
Forgetting to free memory, using uninitialized memory, etc.

Not to mention the standardized tooling that makes the building, documentation, sharing, and linting of gpu kernel libraries easily possible. Most of the reasons for using rust on the CPU apply to using Rust for the GPU, these reasons have been stated countless times so i will not repeat them here.

A couple of particular rust features make writing CUDA code much easier: RAII and Results. In cust everything uses RAII (through Drop impls) to manage freeing memory and returning handles, which frees users from having to think about that, which yields safer, more reliable code.

Results are particularly helpful, almost every single call in every CUDA library returns a status code in the form of a cuda result. Ignoring these statuses is very dangerous and can often lead to random segfaults and overall unrealiable code. For this purpose, both the CUDA SDK, and other libraries provide macros to handle such statuses. This handling is not very reliable and causes dependency issues down the line.

Instead of an unreliable system of macros, we can leverage rust results for this. In cust we return special CudaResult<T> results that can be bubbled up using rust's ? operator, or, similar to CUDA_SAFE_CALL can be unwrapped or expected if proper error handling is not needed.

The CUDA Pipeline

As you may already know, "traditional" cuda is usually in the form of CUDA C/C++ files which use .cu extension. These files can be compiled using NVCC (NVIDIA CUDA Compiler) into an executable.

CUDA files consist of device and host functions. device functions are functions that run on the GPU, also called kernels. host functions run on the CPU and usually include logic on how to allocate GPU memory and call device functions.

However, a lot goes on behind the scenes that most people don't know about, a lot of it is integral to how rustc_codegen_nvvm works so we will briefly go over it.

Stages

The NVIDIA CUDA Compiler consists of distinct stages of compilation:

![NVCC]

NVCC separates device and host functions and compiles them separately. Most importantly, device functions are compiled to LLVM IR, and then the LLVM IR is fed to a library called libnvvm.

libnvvm is a closed source library which takes in a subset of LLVM IR, it optimizes it further, then it turns it into the next and most important stage of compilation, the PTX ISA.

PTX is a low level, assembly-like format with an open specification which can be targeted by any language.

We won't dig deep into what happens after PTX, but in essence, it is turned into a final format called SASS which is register allocated and is finally sent to the GPU to execute.

libnvvm

The stage/library we are most interested in is libnvvm. libnvvm is a closed source library that is distributed in every download of the CUDA SDK. Libnvvm takes a format called NVVM IR, it optimizes it, and converts it to a single PTX file you can run on NVIDIA GPUs using the driver or runtime API.

NVVM IR is a subset of LLVM IR, that is to say, it is a version of LLVM IR with restrictions. A couple of examples being:

Many intrinsics are unsupported
"Irregular" integer types such as i4 or i111 are unsupported and will segfault (however in theory they should be supported)
Global names cannot include ..
Some linkage types are not supported.
Function ABIs are ignored, everything uses the PTX calling convention.

You can find the full specification of the NVVM IR here if you are interested.

Special PTX features

As far as an assembly format goes, PTX is fairly user friendly for a couple of reasons:

It is well formatted.
It is mostly fully specified (other than the iffy grammar specification).
It uses named registers/parameters
It uses virtual registers (since gpus have thousands of registers, listing all of them out would be unrealistic).
It uses ASCII as a file encoding.

Technical

This section will cover the more technical details of how rustc_codegen_nvvm works as well as the issues that came with it.

It will also explain some technical details about CUDA/PTX/etc, it is not necessarily limited to rustc_codegen_nvvm.

Basic knowledge of how rustc and LLVM work and what they do is assumed. You can find info about rustc in the rustc dev guide.

Custom Rustc Backends

Before we get into the details of rustc_codegen_nvvm, we obviously need to explain what a codegen is!

Custom codegens are rustc's answer to "well what if i want rust to compile to X?". This is a problem that comes up in many situations, especially conversations of "well LLVM cannot target this, so we are screwed". To solve this problem, rustc decided to incrementally decouple itself from being attached/reliant on LLVM exclusively.

Previously, rustc only had a single codegen, the LLVM codegen. The LLVM codegen translated MIR directly to LLVM IR. This is great if you just want to support LLVM, but LLVM is not perfect, and inevitably you will hit limits to what LLVM is able to do. Or, you may just want to stop using LLVM, LLVM is not without problems (it is often slow, clunky to deal with, and does not support a lot of targets).

Nowadays, Rustc is almost fully decoupled from LLVM and it is instead generic over the "codegen" backend used. Rustc instead uses a system of codegen backends that implement traits and then get loaded as dynamically linked libraries. This allows rust to compile to virtually anything with a surprisingly small amount of work. At the time of writing, there are five publicly known codegens that exist:

rustc_codegen_clif, cranelift
rustc_codegen_llvm
rustc_codegen_gcc
rustc_codegen_spirv
rustc_codegen_nvvm, obviously the best codegen ;)

rustc_codegen_clif targets the cranelift backend, which is a codegen backend written in rust that is faster than LLVM but does not have many optimizations compared to LLVM. rustc_codegen_llvm is obvious, it is the backend almost everybody uses which targets LLVM. rustc_codegen_gcc targets GCC (GNU Compiler Collection) which is able to target more exotic targets than LLVM, especially for embedded. rustc_codegen_spirv targets the SPIR-V (Standard Portable Intermediate Representation 5) format, which is a format mostly used for compiling shader languages such as GLSL or WGSL to a standard representation that Vulkan/OpenGL can use, the reasons why SPIR-V is not an alternative to CUDA/rustc_codegen_nvvm have been covered in the FAQ.

Finally, we come to the star of the show, rustc_codegen_nvvm. This backend targets NVVM IR for compiling rust to gpu kernels that can be run by CUDA. What NVVM IR/libnvvm are has been covered in the CUDA section.

rustc_codegen_ssa

Despite its name, rustc_codegen_ssa does not actually codegen to anything, it is however the central crate behind every single codegen. The SSA codegen does most of the hard work in codegen, which is actually codegenning MIR and taking care of managing codegen altogether.

The SSA codegen abstracts away the MIR lowering logic so that custom codegens do not have to implement the time consuming logic of lowering MIR, they can just implement a bunch of traits and the SSA codegen does everything else.

The SSA codegen is literally just a bunch of traits, for example:

A trait for getting a type like an integer type
A trait for optimizing a module
A trait for linking everything
A trait for declaring a function ...etc You will find an SSA codegen trait in almost every single file.

rustc_codegen_nvvm

At the highest level, our codegen workflow goes like this:

Source code -> Typechecking -> MIR -> SSA Codegen -> LLVM IR (NVVM IR) -> PTX -> PTX opts/function DCE -> Final PTX
               |                                     |                  |      |                                  ^
               |                                     |          libnvvm +------+                                  |
               |                                     |                                                            |
               |                  rustc_codegen_nvvm +------------------------------------------------------------|
         Rustc +---------------------------------------------------------------------------------------------------

Before we do anything, rustc does its normal job, it typechecks, converts everything to MIR, etc. Then, rustc loads our codegen shared lib and invokes it to codegen the MIR. It creates an instance of NvvmCodegenBackend and it invokes codegen_crate. You could do anything inside codegen_crate but we just defer back to rustc_codegen_ssa and tell it to do the job for us:

fn codegen_crate<'tcx>(
    &self,
    tcx: TyCtxt<'tcx>,
    metadata: EncodedMetadata,
    need_metadata_module: bool,
) -> Box<dyn std::any::Any> {
    Box::new(rustc_codegen_ssa::base::codegen_crate(
        NvvmCodegenBackend,
        tcx,
        String::new(),
        metadata,
        need_metadata_module,
    ))
}

After that, the codegen logic is kind of abstracted away from us, which is a good thing! We just need to provide the SSA codegen whatever it needs to do its thing. This is done in the form of traits, lots and lots and lots of traits, more traits than you've ever seen, traits your subconscious has warned you of in nightmares, anyways. Because talking about how the SSA codegen works is kind of useless, we will instead talk first about general concepts and terminology, then dive into each trait.

But first, let's talk about the end of the codegen, it is pretty simple, we do a couple of things: after codegen is done and LLVM has been run to optimize each module

1: We gather every llvm bitcode module we created.
2: We create a new libnvvm program.
3: We add every bitcode module to the libnvvm program.
4: We try to find libdevice and add it to the program (see nvidia docs on what libdevice is).
5: We run the verifier on the nvvm program just to check that we did not create any invalid nvvm ir.
6: We run the compiler which gives us a final PTX string, hooray!
7: Finally, the PTX goes through a small stage where its parsed and function DCE is run to eliminate Most of the bloat in the file, traditionally this is done by the linker but theres no linker to be found for miles here.
8: We write this ptx file to wherever rustc tells us to write the final file.

We will cover the libnvvm steps in more detail later on.

Codegen Units (CGUs)

Ah codegen units, the thing everyone just tells you to set to 1 in Cargo.toml, but what are they? Well, to put it simply, codegen units are rustc splitting up a crate into different modules to then run LLVM in parallel over. For example, rustc can run LLVM over two different modules in parallel and save time.

This gets a little bit more complex with generics, because MIR is not monomorphized and monomorphized MIR is not a thing, the codegen monomorphizes instances on the fly. Therefore rustc needs to put any generic functions that one CGU relies on inside of the same CGU because it needs to monomorphize them.

Rlibs

rlibs are mysterious files, their origins are mysterious and their contents are the deepest layer of the iceberg. Just kidding, but rlibs often confuse people (including me at first). Rlibs are rustc's way of encoding basically everything it needs to know about a crate into a file. Rlibs usually contain the following:

Object files for each CGU.
LLVM Bitcode.
a Symbol table.
metadata:
- rustc version (because things can go kaboom if version mismatches, ABIs are fun amirite)
- A crate hash
- a crate id
- info about the source files
- the exported API, things like macros, traits, etc.
- MIR, for things such as generic functions and #[inline]d functions (please don't put #[inline] on everything, rustc will cry)

Types

Types! who doesn't love types, especially those that cause libnvvm to randomly segfault or loop forever! Anyways, types are an integral part of the codegen and everything revolves around them and you will see them everywhere.

rustc_codegen_ssa does not actually tell you what your type representation should be, it allows you to decide. For example, rust-gpu represents it as a SpirvType enum, while both cg_llvm and our codegen represent it as opaque llvm types:

type Type = &'ll llvm::Type;

llvm::Type is an opaque type that comes from llvm-c. 'll is one of the main lifetimes you will see throughout the whole codegen, it is used for anything that lasts as long as the current usage of llvm. LLVM gives you back pointers when you ask for a type or value, some time ago cg_llvm fully switched to using references over pointers, and we follow in their footsteps.

One important fact about types is that they are opaque, you cannot take a type and ask "is this X struct?", this is like asking "which chickens were responsible for this omelette?". You can ask if its a number type, a vector type, a void type, etc.

The SSA codegen needs to ask the backend for types for everything it needs to codegen MIR. It does this using a trait called BaseTypeMethods:

pub trait BaseTypeMethods<'tcx>: Backend<'tcx> {
    fn type_i1(&self) -> Self::Type;
    fn type_i8(&self) -> Self::Type;
    fn type_i16(&self) -> Self::Type;
    fn type_i32(&self) -> Self::Type;
    fn type_i64(&self) -> Self::Type;
    fn type_i128(&self) -> Self::Type;
    fn type_isize(&self) -> Self::Type;

    fn type_f32(&self) -> Self::Type;
    fn type_f64(&self) -> Self::Type;

    fn type_func(&self, args: &[Self::Type], ret: Self::Type) -> Self::Type;
    fn type_struct(&self, els: &[Self::Type], packed: bool) -> Self::Type;
    fn type_kind(&self, ty: Self::Type) -> TypeKind;
    fn type_ptr_to(&self, ty: Self::Type) -> Self::Type;
    fn type_ptr_to_ext(&self, ty: Self::Type, address_space: AddressSpace) -> Self::Type;
    fn element_type(&self, ty: Self::Type) -> Self::Type;

    /// Returns the number of elements in `self` if it is a LLVM vector type.
    fn vector_length(&self, ty: Self::Type) -> usize;

    fn float_width(&self, ty: Self::Type) -> usize;

    /// Retrieves the bit width of the integer type `self`.
    fn int_width(&self, ty: Self::Type) -> u64;

    fn val_ty(&self, v: Self::Value) -> Self::Type;
}

Every codegen implements this some way or another, you can find our implementation in ty.rs. Our implementation is pretty straightforward, LLVM has functions that we link to which get us the types we need:

impl<'ll, 'tcx> BaseTypeMethods<'tcx> for CodegenCx<'ll, 'tcx> {
    fn type_i1(&self) -> &'ll Type {
        unsafe { llvm::LLVMInt1TypeInContext(self.llcx) }
    }

    fn type_i8(&self) -> &'ll Type {
        unsafe { llvm::LLVMInt8TypeInContext(self.llcx) }
    }

    fn type_i16(&self) -> &'ll Type {
        unsafe { llvm::LLVMInt16TypeInContext(self.llcx) }
    }

    fn type_i32(&self) -> &'ll Type {
        unsafe { llvm::LLVMInt32TypeInContext(self.llcx) }
    }

    fn type_i64(&self) -> &'ll Type {
        unsafe { llvm::LLVMInt64TypeInContext(self.llcx) }
    }

    fn type_i128(&self) -> &'ll Type {
        unsafe { llvm::LLVMIntTypeInContext(self.llcx, 128) }
    }

    fn type_isize(&self) -> &'ll Type {
        self.isize_ty
    }

    fn type_f32(&self) -> &'ll Type {
        unsafe { llvm::LLVMFloatTypeInContext(self.llcx) }
    }

    fn type_f64(&self) -> &'ll Type {
        unsafe { llvm::LLVMDoubleTypeInContext(self.llcx) }
    }

    fn type_func(&self, args: &[&'ll Type], ret: &'ll Type) -> &'ll Type {
        unsafe { llvm::LLVMFunctionType(ret, args.as_ptr(), args.len() as c_uint, False) }
    }

// ...

There is also logic for handling ABI types, such as generating aggregate (struct) types

Keyboard shortcuts

GPU Computing with Rust using CUDA