Introduction

Welcome to the Rust CUDA Guide!

Goal

The Rust CUDA Project is a project aimed at making Rust a tier-1 language for GPU computing using the CUDA Toolkit. It provides tools for compiling Rust to fast PTX code as well as libraries for using existing CUDA libraries with it.

Background

Historically, general-purpose high-performance GPU computing has been done using the CUDA toolkit. The CUDA toolkit primarily provides a way to use Fortran/C/C++ code for GPU computing in tandem with CPU code with a single source. It also provides many libraries, tools, forums, and documentation to supplement the single-source CPU/GPU code.

CUDA is exclusively an NVIDIA-only toolkit. Many tools have been proposed for cross-platform GPU computing such as OpenCL, Vulkan Computing, and HIP. However, CUDA remains the most used toolkit for such tasks by far. This is why it is imperative to make Rust a viable option for use with the CUDA toolkit.

However, CUDA with Rust has been a historically very rocky road. The only viable option until now has been to use the LLVM PTX backend. However, the LLVM PTX backend does not always work and would generate invalid PTX for many common Rust operations. In recent years it has been shown time and time again that a specialized solution is needed for Rust on the GPU with the advent of projects such as rust-gpu (for translating Rust to SPIR-V).

Our hope is that with this project we can push the Rust on GPUs forward and make Rust an excellent language for such tasks. Rust offers plenty of benefits such as __restrict__ performance benefits for every kernel, an excellent module/crate system, delimiting of unsafe areas of CPU/GPU code with unsafe, high-level wrappers to low-level CUDA libraries, etc.

Structure

The scope of the Rust CUDA Project is broad, spanning the entirety of the CUDA ecosystem, with libraries and tools to make it usable using Rust. Therefore, the project contains many crates for all corners of the CUDA ecosystem.

rustc_codegen_nvvm is a rustc backend that targets NVVM IR (a subset of LLVM IR) for the libnvvm library.
- Generates highly optimized PTX code which can be loaded by the CUDA Driver API to execute on the GPU.
- For now it is CUDA-only, but it may be used to target AMD GPUs in the future.
cuda_std contains GPU-side functions and utilities, such as thread index queries, memory allocation, warp intrinsics, etc.
- It is not a low level library. It provides many utility functions to make it easier to write cleaner and more reliable GPU kernels.
- It is Closely tied to rustc_codegen_nvvm which exposes GPU features through it internally.
cust contains CPU-side CUDA features such as launching GPU kernels, GPU memory allocation, device queries, etc.
- It is a high-level wrapper for the CUDA Driver API, the lower level alternative to the more common CUDA Runtime API used from C++. It provides more fine-grained control over things like kernel concurrency and module loading than the Runtime API.
- High-level Rust features such as RAII and Result make it easier and cleaner to manage the interface to the GPU.
cudnn is a collection of GPU-accelerated primitives for deep neural networks.
gpu_rand does GPU-friendly random number generation. It currently only implements xoroshiro RNGs from rand_xoshiro.
optix provides CPU-side hardware raytracing and denoising using the CUDA OptiX library. (This library is currently commented out because the OptiX SDK is difficult to install.)

There are also several “glue” crates for things such as high level wrappers for certain smaller CUDA libraries.

Other projects related to using Rust on the GPU:

2016: glassful translates a subset of Rust to GLSL.
2017: inspirv-rust is an experimental Rust-MIR-to-SPIR-V compiler.
2018: nvptx is a Rust-to-PTX compiler using the nvptx target for rustc (using the LLVM PTX backend).
2020: accel is a higher-level library that relied on the same mechanism that nvptx does.
2020: rlsl is an experimental Rust-to-SPIR-V compiler (and a predecessor to rust-gpu).
2020: rust-gpu is a rustc compiler backend to compile Rust to SPIR-V for use in shaders. Like Rust CUDA, it is part of the broader Rust GPU project.

Guide

This section covers some of the basics.

Getting started

Required libraries

Rust CUDA has several prerequisites.

A machine with an NVIDIA GPU with a Compute Capability of 5.0 (Maxwell) or later.
CUDA version 12.0 or later.
An appropriate NVIDIA driver.
- For CUDA 12, the driver version N should be in the range 525 <= N < 580.
- For CUDA 13, the driver version N should be in the range 580 <= N.
LLVM 7.x (7.0 to 7.4). This is (unfortunately) a very old version of LLVM. The codegen backend searches multiple places for LLVM.
- If LLVM_CONFIG is present, the backend will use that path as llvm-config.
- Or, if llvm-config is present as a binary, the backend will use that, assuming that llvm-config --version returns 7.x.x.
- Failing that, the backend will attempt to download and use a prebuilt LLVM. This currently only works on Windows, however.

Because the required libraries can be difficult to install, we provide Docker images containing CUDA and LLVM 7. There are instructions on using these Docker images below. Alternatively, if you do want to install these libraries yourself, the steps within the Docker files are a good starting point.

CUDA basics

GPU kernels are functions launched from the CPU that run on the GPU. They do not have a return value, instead writing data into mutable buffers passed to them. CUDA executes multiple (possibly hundreds) of invocations of a GPU kernel at once, each one on a different thread, and each thread typically works on only part of the input and output buffers, sometimes just a single element thereof.

The caller decides the launch dimensions.

Threads: A single thread executes the GPU kernel once. CUDA makes the thread’s index available to the kernel.
Blocks: A single block houses multiple threads that it execute on its own. CUDA also makes the blocks index avaiable to the kernel.

Block and thread dimensions may be 1D, 2D, or 3D. For example, you can launch 1 block of 6 threads, or 6x6 threads, or 6x6x6 threads. Likewise, you can launch 5 or 5x5 or 5x5x5 blocks. This can make index calculations for programs with 2D or 3D data simpler.

A first example: the code

This section will walk through a simple Rust CUDA program that adds two small 1D vectors on the GPU. It consists of two tiny crates and some connecting pieces.

The file structure looks like this:

.
├── rust-toolchain.toml  # Specifies which nightly version to use
├── build.rs             # Build script that compiles the code that runs on the GPU
├── kernels
│   ├── Cargo.toml       # Cargo manifest for code that runs on the GPU
│   └── src
│       └── lib.rs       # Code that runs on the GPU
├── Cargo.toml           # Cargo manifest for code that runs on the CPU
└── src
    └── main.rs          # Code that runs on the CPU

`rust-toolchain.toml`

rustc_codegen_nvvm currently requires a specific version of Rust nightly because it uses rustc internals that are subject to change. You must copy the appropriate revision of rust-toolchain.toml from the rust-cuda repository so that your own project uses the correct nightly version.

`Cargo.toml` and `kernels/Cargo.toml`

The top-level Cargo.toml looks like this:

[package]
name = "rust-cuda-basic"
version = "0.1.0"
edition = "2024"

[dependencies]
cust = { git = "https://github.com/rust-gpu/rust-cuda", rev = "7fa76f3d717038a92c90bf4a482b0b8dd3259344" }
kernels = { path = "kernels" }

[build-dependencies]
cuda_builder = { git = "https://github.com/rust-gpu/rust-cuda", rev = "7fa76f3d717038a92c90bf4a482b0b8dd3259344", features = ["rustc_codegen_nvvm"] }

kernels/Cargo.toml looks like this:

[package]
name = "kernels"
version = "0.1.0"
edition = "2024"

[dependencies]
cuda_std = { git = "https://github.com/rust-gpu/rust-cuda", rev = "7fa76f3d717038a92c90bf4a482b0b8dd3259344" }

[lib]
# - cdylib: because the nvptx targets do not support binary crate types.
# - rlib: so the `kernels` crate can be used as a dependency by `rust-cuda-basic`.
crate-type = ["cdylib", "rlib"]

At the time of writing there are no recent releases of any Rust CUDA crates so it is best to use code directly from the GitHub repository via git and rev. The above revision works but later revisions should also work.

`kernels/src/lib.rs`

This file defines the code that will run on the GPU.

#![allow(unused)]
fn main() {
use cuda_std::prelude::*;

// Input/output type shared with the `rustc-cuda-basic` crate.
pub type T = f32;

#[kernel]
#[allow(improper_ctypes_definitions)]
pub unsafe fn add(a: &[T], b: &[T], c: *mut T) {
    let i = thread::index_1d() as usize;
    if i < a.len() {
        let elem = unsafe { &mut *c.add(i) };
        *elem = a[i] + b[i];
    }
}
}

It defines the addition of a single pair of elements in a and b. Some parts of this file look like normal Rust code, but some parts are unusual.

The type T will be shared with the CPU code in a way that minimizes the chances of certain kinds of errors. More on this below.
The #[kernel] attribute indicates this is code that runs on the GPU. It is similar to __global__ in CUDA C++. Multiple invocations of this kernel will run in parallel and share a, b, and c.
The proc macro that processes the #[kernel] attribute marks the kernel as no_mangle so that the name is obvious in both GPU code and CPU code. The proc macro also checks that the kernel is marked unsafe, all parameters are Copy, and there is no return value.
All GPU functions are unsafe because the parallel execution and sharing of data typical for GPU kernels is incompatible with safe Rust.
The inputs (a and b) are normal slices but the output (c) is a raw pointer. Again, this is because c is mutable state shared by multiple kernels executing in parallel. Using &mut [T] would incorrectly indicate that it is non-shared mutable state, and therefore Rust CUDA does not allow mutable references as argument to kernels. Raw pointers do not have this restriction. Therefore, we use a pointer and only make a mutable reference once we have an element (c.add(i)) that we know won’t be touched by other kernel invocations.
The #[allow(improper_ctypes_definitions)] follows on from this. The kernel boundary is like an FFI boundary, and slices are not normally allowed there because they are not guaranteed to be passed in a particular way. However, rustc_codegen_nvvm does guarantee the way in which things like structs, slices, and arrays are passed (see Kernel ABI). Therefore this lint can be disabled.
thread::index_1d() gives the globally-unique thread index. The check i < a.len() bounds check is necessary because threads run in blocks, and sometimes indices that exceed an inputs bounds occur.
The entire crate is compiled as no_std. If you want to use alloc, just add extern crate alloc; to the file.
The crate is actually compiled twice. Once by cuda_builder to produce PTX code for the kernels, and once normally by Cargo to produce the rlib for definitions (such as T) shared with the top-level crate.

Although this example only includes one kernel, larger examples contain multiple kernels, which is why the name kernels is used.

`build.rs`

The build script uses cuda_builder to compile the kernel to PTX code. Under the covers, cuda_builder uses rustc with rustc_codegen_nvvm. kernels.ptx will be embedded in the main executable.

use std::env;
use std::path;

use cuda_builder::CudaBuilder;

fn main() {
    println!("cargo::rerun-if-changed=build.rs");
    println!("cargo::rerun-if-changed=kernels");

    let out_dir = path::PathBuf::from(env::var("OUT_DIR").unwrap());
    let manifest_dir = path::PathBuf::from(env::var("CARGO_MANIFEST_DIR").unwrap());

    // Compile the `kernels` crate to `$OUT_DIR/kernels.ptx`.
    CudaBuilder::new(manifest_dir.join("kernels"))
        .copy_to(out_dir.join("kernels.ptx"))
        .build()
        .unwrap();
}

You can specify a different compilation target by inserting an arch call in the method chain, e.g.:

#![allow(unused)]
fn main() {
        .arch(cuda_builder::NvvmArch::Compute90)  // Target compute capability 9.0
}

The compile target determines which GPU features are available. See the Compute Capability Gating guide for details on writing code that adapts to different GPU capabilities.

`src/main.rs`

The final file contains main, which ties everything together.

use cust::prelude::*;
use kernels::T;
use std::error::Error;

// Embed the PTX code as a static string.
static PTX: &str = include_str!(concat!(env!("OUT_DIR"), "/kernels.ptx"));

fn main() -> Result<(), Box<dyn Error>> {
    // Initialize the CUDA Driver API. `_ctx` must be kept alive until the end.
    let _ctx = cust::quick_init()?;

    // Create a module from the PTX code compiled by `cuda_builder`.
    let module = Module::from_ptx(PTX, &[])?;

    // Create a stream, which is like a thread for dispatching GPU calls.
    let stream = Stream::new(StreamFlags::NON_BLOCKING, None)?;

    // Initialize input and output buffers in CPU memory.
    let a: [T; _] = [1.0, 2.0, 3.0, 4.0];
    let b: [T; _] = [2.0, 3.0, 4.0, 5.0];
    let mut c: Vec<T> = vec![0.0 as T; a.len()];

    // Allocate memory on the GPU and copy the contents from the CPU memory.
    let a_gpu = a.as_dbuf()?;
    let b_gpu = b.as_dbuf()?;
    let c_gpu = c.as_slice().as_dbuf()?;

    // Launch the kernel on the GPU.
    // - The first two parameters between the triple angle brackets specify 1
    //   block of 4 threads.
    // - The third parameter is the number of bytes of dynamic shared memory.
    //   This is usually zero.
    // - These threads run in parallel, so each kernel invocation must modify
    //   separate parts of `c_gpu`. It is the kernel author's responsibility to
    //   ensure this.
    // - Immutable slices are passed via pointer/length pairs. This is unsafe
    //   because the kernel function is unsafe, but also because, like an FFI
    //   call, any mismatch between this call and the called kernel could
    //   result in incorrect behaviour or even uncontrolled crashes.
    let add_kernel = module.get_function("add")?;
    unsafe {
        launch!(
            add_kernel<<<1, 4, 0, stream>>>(
                a_gpu.as_device_ptr(),
                a_gpu.len(),
                b_gpu.as_device_ptr(),
                b_gpu.len(),
                c_gpu.as_device_ptr(),
            )
        )?;
    }

    // Synchronize all threads, i.e. ensure they have all completed before continuing.
    stream.synchronize()?;

    // Copy the GPU memory back to the CPU.
    c_gpu.copy_to(&mut c)?;

    println!("c = {:?}", c);

    Ok(())
}

Because T is shared between the crates, the type used in the buffers could be changed from f32 to f64 by modifying just the definition of T. Without that, such a change would require modifying lines in both crates, and any inconsistencies could cause correctness problems.

A first example: building and running

There are two ways to build and run this example: natively, and with docker.

Native

If you have all the required libraries installed, try building with cargo build.

If you get an error “libnvvm.so.4: cannot open shared object file”, you will need to adjust LD_LIBRARY_PATH, something like this:

export LD_LIBRARY_PATH="/usr/local/cuda/nvvm/lib64:${LD_LIBRARY_PATH}"

If you get an error “error: couldn’t load codegen backend” on Windows, you will need to adjust PATH, something like this with CUDA 12:

$env:PATH += ";C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\nvvm\bin"

or this with CUDA 13:

$env:PATH += ";C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\nvvm\bin\x64"

You should then be able to cargo run, and see the expected output:

c = [3.0, 5.0, 7.0, 9.0]

Docker

Docker is complicated. If you already know how it works, feel free to use the provided images however you like. The rest of this section aims to provide basic instructions for those who are less confident.

First, ensure you have Docker setup to use GPUs. Even with Docker, your machine will need an appropriate driver.

You can build your own docker image but it is easier to use a prebuilt one. The dcr script uses docker create with a prebuilt image to create a container that contains the required libraries. It then uses docker start to start the container in such a way that it will run indefinitely unless explicitly stopped. Even if the host machine is rebooted the container will automatically restart.

Once the container is started, the dex script uses docker exec to run arbitrary commands within the container. For example, dex cargo build will execute cargo build within the container.

Some useful docker commands:

docker exec -it rust-cuda bash: run a bash shell within the container. This lets you operate inside the container indefinitely. But facilities within the container are limited, so using dex to run commands one at a time is generally easier.
docker images: show the status of all local images.
docker ps: show the status of running containers.
docker ps --all: show the status of all containers.
docker stop rust-cuda: stop the rust-cuda container.
docker rm rust-cuda: remove the rust-cuda container, which must have been stopped.

If you have problems with the container, the following steps may help with checking that your GPU is recognized.

Check if dex nvidia-smi provides meaningful output.
NVIDIA provides a number of samples. You could try makeing and running the deviceQuery sample. If all is well it will print various details about your GPU.

A sample .devcontainer.json file is also included, configured for Ubuntu 24.04. Copy this to .devcontainer/devcontainer.json to make additional customizations.

More examples

The examples directory has more complex examples. They all follow the same basic structure as this first example.

Compute capability gating

This section covers how to write code that adapts to different CUDA compute capabilities using conditional compilation.

What are compute capabilities?

CUDA GPUs have different “compute capabilities” that determine which features they support. Each capability is identified by a version number like 3.5, 5.0, 6.1, 7.5, etc. Higher numbers generally mean more features are available.

For example:

Compute capability 5.0+ supports 64-bit integer min/max and bitwise atomic operations
Compute capability 6.0+ supports double-precision (f64) atomic operations
Compute capability 7.0+ supports tensor core operations

For comprehensive details, see NVIDIA’s CUDA documentation on GPU architectures.

Virtual vs real Architectures

In CUDA terminology:

Virtual architectures (compute_XX) define the PTX instruction set and available features
Real architectures (sm_XX) represent actual GPU hardware

Rust CUDA works exclusively with virtual architectures since it only generates PTX. The NvvmArch::ComputeXX enum values correspond to CUDA’s virtual architectures.

Using target features

When building your kernel, the NvvmArch::ComputeXX variant you choose enables specific target_feature flags. These can be used with #[cfg(...)] to conditionally compile code based on the capabilities of the target GPU.

For example, this checks whether the target architecture supports running compute 6.0 code or newer:

#![allow(unused)]
fn main() {
#[cfg(target_feature = "compute_60")]
}

Think of it as asking: “Is the GPU I’m building for at least compute 6.0?” Depending on which NvvmArch::ComputeXX is used to build the kernel, there is a different answer:

Building for Compute60 → ✓ Yes (exact match)
Building for Compute70 → ✓ Yes (7.0 GPUs support 6.0 code)
Building for Compute50 → ✗ No (5.0 GPUs can’t run 6.0 code)

These features let you write optimized code paths for specific GPU generations while still supporting older ones.

Specifying compute capabilites

Starting with CUDA 12.9, NVIDIA introduced architecture suffixes that affect compatibility.

Base architecture (no suffix)

Example: NvvmArch::Compute70

This is everything mentioned above, and was the only option in CUDA 12.8 and lower.

When to use: Default choice for maximum compatibility.

Example usage:

#![allow(unused)]
fn main() {
// In build.rs
CudaBuilder::new("kernels")
    .arch(NvvmArch::Compute70)
    .build()
    .unwrap();

// In your kernel code:
#[cfg(target_feature = "compute_60")]  // ✓ Pass (lower base variant)
#[cfg(target_feature = "compute_70")]  // ✓ Pass (this base variant))
#[cfg(target_feature = "compute_80")]  // ✗ Fail (higher base variant)
}

Family suffix (‘f’)

Example: NvvmArch::Compute101f

Specifies code compatible with the same major compute capability version and with an equal or higher minor compute capability version.

When to use: When you need features from a specific minor version but want forward compatibility within the family.

Example usage:

#![allow(unused)]
fn main() {
// In build.rs
CudaBuilder::new("kernels")
    .arch(NvvmArch::Compute101f)
    .build()
    .unwrap();

// In your kernel code:
#[cfg(target_feature = "compute_90")]    // ✓ Pass (lower base variant)
#[cfg(target_feature = "compute_100")]   // ✓ Pass (lower base variant)
#[cfg(target_feature = "compute_100f")]  // ✓ Pass (lower 'f' variant)
#[cfg(target_feature = "compute_101")]   // ✓ Pass (this base variant)
#[cfg(target_feature = "compute_101f")]  // ✓ Pass (the 'f' variant itself)
#[cfg(target_feature = "compute_103")]   // ✗ Fail (higher base variant)
#[cfg(target_feature = "compute_110")]   // ✗ Fail (higher base variant)
}

Architecture suffix (‘a’)

Example: NvvmArch::Compute100a

Specifies code that only runs on GPUs of that specific compute capability and no others. However, during compilation, it enables all available instructions for the architecture, including all base variants up to the same version and all family variants with the same major version and equal or lower minor version.

When to use: When you need to use architecture-specific features (like certain Tensor Core operations) that are only available on that exact GPU model.

Example usage:

#![allow(unused)]
fn main() {
// In build.rs
CudaBuilder::new("kernels")
    .arch(NvvmArch::Compute100a)
    .build()
    .unwrap();

// In your kernel code:
#[cfg(target_feature = "compute_90")]    // ✓ Pass (lower base variant)
#[cfg(target_feature = "compute_100")]   // ✓ Pass (base variant)
#[cfg(target_feature = "compute_100f")]  // ✓ Pass (family variant with same major/minor)
#[cfg(target_feature = "compute_100a")]  // ✓ Pass (the 'a' variant itself)
#[cfg(target_feature = "compute_101f")]  // ✗ Fail (higher family variant)
#[cfg(target_feature = "compute_110")]   // ✗ Fail (higher base variant)
}

Note: While the ‘a’ variant enables all these features during compilation (allowing you to use all available instructions), the generated PTX code will still only run on the exact GPU architecture specified.

For more details on suffixes, see NVIDIA’s blog post on family-specific architecture features.

Manual compilation (without `cuda_builder`)

If you’re invoking rustc directly instead of using cuda_builder, you only need to specify the architecture through LLVM args:

rustc --target nvptx64-nvidia-cuda \
    -C llvm-args=-arch=compute_61 \
    -Z codegen-backend=/path/to/librustc_codegen_nvvm.so \
    ...

Or with cargo:

export RUSTFLAGS="-C llvm-args=-arch=compute_61 -Z codegen-backend=/path/to/librustc_codegen_nvvm.so"
cargo build --target nvptx64-nvidia-cuda

The codegen backend automatically synthesizes target features based on the architecture type as described above.

Common patterns for base architectures

These patterns work when using base architectures (no suffix), which enable all lower capabilities:

At least a capability (default)

#![allow(unused)]
fn main() {
// Code that requires compute 6.0 or higher
#[cfg(target_feature = "compute_60")]
{
    cuda_std::atomic::atomic_add(data, 1.0); // f64 atomics need 6.0+
}
}

Exactly one capability

#![allow(unused)]
fn main() {
// Code that targets exactly compute 6.1 (not 6.2+)
#[cfg(all(target_feature = "compute_61", not(target_feature = "compute_62")))]
{
    // Features specific to compute 6.1
}
}

Up to a maximum capability

#![allow(unused)]
fn main() {
// Code that works up to compute 6.0 (not 6.1+)
#[cfg(not(target_feature = "compute_61"))]
{
    // Maximum compatibility implementation
}
}

Targeting specific architecture ranges

#![allow(unused)]
fn main() {
// This block compiles when building for architectures >= 6.0 but < 8.0
#[cfg(all(target_feature = "compute_60", not(target_feature = "compute_80")))]
{
    // Code here can use features from 6.0+ but must not use 8.0+ features
}
}

Debugging capability issues

If you encounter errors about missing functions or features:

Check the compute capability you’re targeting in cuda_builder
Verify your GPU supports the features you’re using
Use nvidia-smi to check your GPU’s compute capability
Add appropriate #[cfg] guards or increase the target architecture

Runtime behavior

Again, Rust CUDA only generates PTX, not pre-compiled GPU binaries (“fatbinaries”). This PTX is then JIT-compiled by the CUDA driver at runtime.

For more details, see NVIDIA’s documentation on GPU compilation and JIT compilation.

Tips

This section contains some tips on what to do and what not to do using the project.

GPU kernels

Generally don’t derive Debug for structs in GPU crates. The codegen backend currently does not do much global DCE (dead code elimination) so debug can really slow down compile times and make the PTX gigantic. This will get much better in the future but currently it will cause some undesirable effects.
Don’t use recursion, CUDA allows it but threads have very limited stacks (local memory) and stack overflows yield confusing InvalidAddress errors. If you are getting such an error, run the executable in cuda-memcheck, it should yield a write failure to Local memory at an address of about 16mb. You can also put the PTX file through cuobjdump and it should yield ptxas warnings for functions without a statically known stack usage.

Kernel ABI

This section details how parameters are passed to GPU kernels by the codegen backend. In other words, how the codegen backend expects you to pass different types to GPU kernels from the CPU.

⚠️ If you find any bugs in the ABI please report them. ⚠️

Preface

Please note that the following only applies to non-Rust call conventions, we make zero guarantees about the Rust call convention, just like rustc.

While we currently override every ABI except Rust, you should generally only use "C", any other ABI we override purely to avoid footguns.

Functions marked as #[kernel] are enforced to be extern "C" by the kernel macro, and it is expected that all GPU kernels be extern "C", not that you should be declaring any kernels without the #[kernel] macro, because the codegen backend/cuda_std is allowed to rely on the behavior of #[kernel] for correctness.

Structs

Structs are always passed directly using byte arrays if they are passed by value in the function. This corresponds to what is expected by CUDA/the PTX ABI.

For example:

#[derive(Clone, Copy)]
#[repr(C)]
pub struct Foo {
    pub a: u16,
    pub b: u64,
    pub c: u128,
}

#[kernel]
pub unsafe fn kernel(a: Foo) {
    /* ... */
}

will map to the following PTX:

.visible .entry kernel(
	.param .align 16 .b8 kernel_param_0[32]
)

Consequently, it is expected that you will pass the struct by value when launching the kernel, and not by reference (by allocating a device box):

let foo = Foo { 
    a: 5,
    b: 6,
    c: 7
};

unsafe {
    launch!(
        module.kernel<<<1, 1, 0, stream>>>(foo)
    )?;
}

And not

let foo = DeviceBox::new(Foo { 
    a: 5,
    b: 6,
    c: 7
});

unsafe {
    launch!(
        module.kernel<<<1, 1, 0, stream>>>(foo.as_device_ptr())
    )?;
}

Arrays

Like structs, arrays are always passed by value as byte arrays.

Slices

Slices are passed as two word-sized parameters: a pointer to the beginning of the data, and an integer giving the length of the slice.

For example:

#[kernel]
pub unsafe fn kernel(a: &[u8]) {
  /* ... */
}

Will map to the following PTX (on nvptx64):

.visible .entry kernel(
	.param .u64 kernel_param_0,
	.param .u64 kernel_param_1
)

Consequently, it is expected that you will pass the pointer and the length as multiple parameters when calling the kernel:

let mut buf = [5u8; 10].as_dbuf()?;

unsafe {
  launch!(
    module.kernel<<<1, 1, 0, stream>>>(buf.as_device_ptr(), buf.len())
  )?;
}

You may get warnings about slices being an improper C-type, but the warnings are safe to ignore, the codegen backend guarantees that slices are passed as pairs of params.

You cannot however pass mutable slices, this is because it would violate aliasing rules, each thread receiving a copy of the mutable slice would violate aliasing rules. You may use a &[UnsafeCell<T>] then convert an element to a mutable ref (once you know the element accesses are disjoint), or more commonly, use a raw pointer.

ZSTs

ZSTs (zero-sized types) are ignored and become nothing in the final PTX.

Primitives

Primitive types are passed directly by value, same as structs. They map to the special PTX types .s8, .s16, .s32, .s64, .u8, .u16, .u32, .u64, .f32, and .f64. With the exception that u128 and i128 are passed as byte arrays (but this has no impact on how they are passed from the CPU).

References And pointers

References and Pointers are both passed as expected, as pointers. It is therefore expected that you pass such parameters using device memory:

#[kernel]
pub unsafe fn kernel(a: &u8) {
  /* ... */
}

let mut val = DeviceBox::new(&5)?;

unsafe {
  launch!(
    module.kernel<<<1, 1, 0, stream>>>(val.as_device_ptr())
  )?;
}

repr(Rust) Types

using repr(Rust) types inside of kernels is not disallowed but it is highly discouraged. This is because rustc is allowed to switch up how the types are represented across compiler invocations which leads to hard to track errors.

Therefore, you should generally only use repr(C) inside of kernel parameters. With the exception of slices that have a guaranteed parameter layout.

Safety

With one of Rust’s main foci being memory safety, we strive to make most things safe, without requiring too much unsafe usage and mental checks from the user. However, CUDA’s inherent thread/memory model leaves many things ambiguous as to whether they are sound and makes many invariants inherently impossible to statically prove. In this section we will talk about what kinds of behavior is considered undefined inside of kernels as well as the invariants that must be upheld by the caller of kernels.

⚠️ This list is not fully complete, as the semantics of Rust safety on the GPU have not been explored much, however, this list includes most of the actions “regular” users may commit ⚠️

Behavior considered undefined in GPU kernels

This list is purely meant to be a guide on what behavior should be avoided.

Undefined behavior on the GPU is defined as potentially being able to cause the following (but not limited to):

Unknown/Undefined data being written to a location in memory.
Causing fatal termination of either just the kernel (through trapping), or the entire CUDA driver (through invalid address errors).
Causing LLVM/NVVM to optimize the code into unknown code.

Behavior considered undefined inside of GPU kernels:

Most importantly, any behavior that is considered undefined on the CPU is considered undefined on the GPU too. The only exception being invalid sizes for buffers given to a GPU kernel.

Currently we declare that the invariant that a buffer given to a GPU kernel must be large enough for any access the kernel is going to make is up to the caller of the kernel to uphold. This idiom may be changed in the future.

Any kind of data race, this has the same semantics as data races in CPU code. Such as:
- Multiple threads writing to a location in memory at the same time without synchronization.
- One or more threads reading while a thread is writing to a memory location.
- Reading shared memory while a thread is writing to the location (if for example thread::sync has not been called).

Behavior not currently considered undefined, but considered undesirable:

calling thread::sync inside of a branch that not all threads inside of the thread block have reached.

Behavior considered undefined on the CPU

This list will contain behavior that is considered undefined in the context of actually launching GPU kernels from the CPU.

Streams

Streams will always execute concurrently with each other. That is to say, kernels launched inside of a single stream guarantee that they will be executed one after the other, in order.

However, kernels launched in different streams have no guarantee of execution order, their execution may be interleaved and kernels are likely to be launched concurrently on the GPU.

Therefore, it is undefined behavior to write to the same memory location in kernels executed in different streams without synchronization.

For example:

Foo is allocated as a buffer of memory on the GPU.
Stream 1 launches kernel bar which writes to Foo.
Stream 2 launches kernel bar which also writes to Foo.

This is undefined behavior because the kernels are likely to be executed concurrently, causing a data race when multiple kernels try to write to the same memory.

However, if the thread that Stream 2 is located on calls synchronize() on Stream 1 before launching the kernel, this will be sound. Because synchronize() waits for Stream 2 to finish all of its tasks before giving back control to the calling thread.

Another important detail is that GPU operations on a stream are NOT synchronized with the CPU. This means that CPU code may not rely on a kernel being finished without calling synchronize(). For example:

launch!(module.bar<<<1, 1, 0, stream>>>(foo.as_unified_ptr()))?;
// 'bar' is not guaranteed to be finished executing at this point.
function_that_accesses_foo(foo);
stream.synchronize()?;
// foo may be accessed and will see the changes that 'bar' wrote to 'foo'. 'bar' is guaranteed 
// to be finished executing.

Contexts

Contexts are akin to CPU processes, therefore, it is undefined behavior (although it should always yield an invalid address error) to access another context’s allocated GPU memory.

However, this is very uncommon because single-device code should not need multiple contexts generally. This only becomes relevant when using multiple devices (multi-GPU code) with different contexts.

Note however, that unified memory can be accessed by multiple GPUs and multiple contexts at the same time, as unified memory takes care of copying and moving data automatically from GPUs/CPU when a page fault occurs. For this reason as well as general ease of use, we suggest that unified memory generally be used over regular device memory.

Kernel launches

Kernel Launches are the most unsafe part of CUDA, many things must be checked by the developer to soundly launch a kernel. It is fundamentally impossible for us to verify a large portion of the invariants expected by the kernel/CUDA.

The following invariants must be upheld by the caller of a kernel, failure to do so is undefined behavior:

The number of parameters passed to the kernel must match the expected number of parameters.
The dimensionality expected by the kernel must match, e.g. if the kernel expects 2d thread indices, it is undefined behavior to launch the kernel with 3d thread indices (which would cause a data race). However, it is not undefined behavior to launch the kernel with a dimensionality lower than expected, e.g. launching a 2d kernel with a 1d dimensionality.
The types expected by the kernel must match:
- If the kernel expects a struct, if the struct is repr(Rust), the struct must be the actual struct from the kernel library, otherwise, if it is repr(C) (which is recommended), the fields must all match, including alignment and order of fields.
Reference aliasing rules must not be violated, including:
- Immutable references are allowed to be aliased, e.g. if a kernel expects &T and &T, it is sound to pass the same pointer for both.
- Data behind an immutable reference must not be modified, meaning, it is undefined behavior to pass the same pointer to &T and *mut T, where *mut T is used for modifying the data.
- Parameters such as &[UnsafeCell<T>] must be exclusive, assuming the kernel uses the UnsafeCell<T> to modify the data.
- *mut T does not necessarily need to follow aliasing rules, it is sound to pass the same pointer to two *mut T parameters assuming that the kernel accesses nonoverlapping regions of the memory. If a mutable reference is formed from the pointer, the mutable reference must be exclusive, e.g. it is undefined behavior for two threads to create a mutable reference to the same element in a pointer.
Any buffers passed to the kernel must be large enough for the size that the kernel expects. Allocated buffer size being correct for what the kernel expects is up to the caller, not the kernel.
Not allocating enough dynamic shared memory for how much the kernel expects.

Behavior that is not considered undefined but is undesirable:

Launching a kernel with more threads than expected by its launch bounds (.maxntid in PTX). This will cause the launch to fail.
Launching a kernel with a different number of threads than expected by its launch bounds (.reqntid in PTX). This will also cause the launch to fail.

The CUDA Toolkit

The CUDA Toolkit is an ecosystem for executing extremely fast code on NVIDIA GPUs for the purpose of general computing.

CUDA includes many libraries for this purpose, including the Driver API, Runtime API, the PTX ISA, libNVVM, etc. CUDA is currently the best option for computing in terms of libraries and control available, however, it unfortunately only works on NVIDIA GPUs.

This section will cover some of the general uses of GPU computing, why use CUDA, and general CUDA principles. This section will not cover everything about CUDA and it is not meant to. You can check out the official CUDA guide for a complete overview.

GPU computing

You probably already know what GPU computing is, but if you don’t, it is utilizing the extremely parallel nature of GPUs for purposes other than rendering. It is widely used in many scientific and consumer fields. Some of the most common uses being fluid/smoke simulation, protein folding, physically based rendering, cryptocurrency mining, AI model training, etc.

GPUs excel at tasks that do mostly the same thing every time, and need to do it millions of times. They do not excel at so-called “divergent” tasks, tasks where each run of the task may take different amounts of time and/or take different code paths.

Why CUDA?

CUDA is currently one of the best choices for fast GPU computing for multiple reasons:

It offers deep control over how kernels are dispatched and how memory is managed.
It has a rich ecosystem of tutorials, guides, and libraries such as cuRAND, cuBLAS, libNVVM, OptiX, the PTX ISA, etc.
It is mostly unmatched in performance because it is solely meant for computing and offers rich control. And more…

However, CUDA can only run on NVIDIA GPUs, which precludes AMD GPUs from tools that use it. However, this is a drawback that is acceptable by many because of the significant developer cost of supporting both NVIDIA GPUs with CUDA and AMD GPUs with OpenCL, since OpenCL is generally slower, clunkier, and lacks libraries and docs on par with CUDA.

Why Rust?

Rust is a great choice for GPU programming, however, it has needed a kickstart, which is what rustc_codegen_nvvm tries to accomplish; The initial hurdle of getting Rust to compile to something CUDA can run is over, now comes the design and polish part.

On top of its rich language features (macros, enums, traits, proc macros, great errors, etc), Rust’s safety guarantees can be applied in GPU programming too; A field that has historically been full of implied invariants and unsafety, such as (but not limited to):

Expecting some amount of dynamic shared memory from the caller.
Expecting a certain layout for thread blocks/threads.
Manually handling the indexing of data, leaving code prone to data races if not managed correctly.
Forgetting to free memory, using uninitialized memory, etc.

Not to mention the standardized tooling that makes the building, documentation, sharing, and linting of GPU kernel libraries easily possible. Most of the reasons for using Rust on the CPU apply to using Rust for the GPU, these reasons have been stated countless times so I will not repeat them here.

A couple of particular Rust features make writing CUDA code much easier: RAII and Results. In cust everything uses RAII (through Drop impls) to manage freeing memory and returning handles, which frees users from having to think about that, which yields safer, more reliable code.

Results are particularly helpful, almost every single call in every CUDA library returns a status code in the form of a CUDA result. Ignoring these statuses is very dangerous and can often lead to random segfaults and overall unreliable code. For this purpose, both the CUDA SDK, and other libraries provide macros to handle such statuses. This handling is not very reliable and causes dependency issues down the line.

Instead of an unreliable system of macros, we can leverage Rust results for this. In cust we return special CudaResult<T> results that can be bubbled up using Rust’s ? operator, or, similar to CUDA_SAFE_CALL can be unwrapped or expected if proper error handling is not needed.

The CUDA pipeline

CUDA is traditionally used via CUDA C/C++ files which have a .cu extension. These files can be compiled using NVCC (NVIDIA CUDA Compiler) into an executable.

CUDA files consist of device and host functions. Device functions run on the GPU, and are also called kernels. Host functions run on the CPU and usually include logic on how to allocate GPU memory and call device functions.

Behind the scenes, NVCC has several stages of compilation.

First, NVCC separates device and host functions and compiles them separately. Device functions are compiled to NVVM IR, a subset of LLVM IR with additional restrictions including the following.

Many intrinsics are unsupported.
“Irregular” integer types such as i4 or i111 are unsupported and will segfault (however in theory they should be supported).
Global names cannot include ..
Some linkage types are not supported.
Function ABIs are ignored; everything uses the PTX calling convention.

libNVVM is a closed source library which takes NVVM IR, optimizes it further, then converts it to PTX. PTX is a low level, assembly-like format with an open specification which can be targeted by any language. For an assembly format, PTX is fairly user-friendly.

It is well formatted.
It is mostly fully specified (other than the iffy grammar specification).
It uses named registers/parameters.
It uses virtual registers. (Because GPUs have thousands of registers, listing all of them out would be unrealistic.)
It uses ASCII as a file encoding.

PTX can be run on NVIDIA GPUs using the driver API or runtime API. Those APIs will convert the PTX into a final format called SASS which is register allocated and executed on the GPU.

`rustc_codegen_nvvm`

This section will cover the more technical details of how rustc_codegen_nvvm works as well as the issues that came with it.

It will also explain some technical details about CUDA/PTX/etc, it is not necessarily limited to rustc_codegen_nvvm.

Basic knowledge of how rustc and LLVM work and what they do is assumed. You can find info about rustc in the rustc dev guide.

Custom rustc backends

Before we get into the details of rustc_codegen_nvvm, we obviously need to explain what a codegen backend is!

Custom codegen backends are rustc’s answer to “well what if I want Rust to compile to X?”. This is a problem that comes up in many situations, especially conversations of “well LLVM cannot target this, so we are screwed”. To solve this problem, rustc decided to incrementally decouple itself from being attached/reliant on LLVM exclusively.

Previously, rustc only had a single codegen backend, the LLVM codegen backed. This translated MIR directly to LLVM IR. This is great if you just want to support LLVM, but LLVM is not perfect, and inevitably you will hit limits to what LLVM is able to do. Or, you may just want to stop using LLVM, LLVM is not without problems (it is often slow, clunky to deal with, and does not support a lot of targets).

Nowadays, rustc is almost fully decoupled from LLVM and it is instead generic over the codegen backend used. rustc instead uses a system of codegen backends that implement traits and then get loaded as dynamically linked libraries. This allows Rust to compile to virtually anything with a surprisingly small amount of work. At the time of writing, there are five publicly known codegen backends that exist:

rustc_codegen_cranelift
rustc_codegen_llvm
rustc_codegen_gcc
rustc_codegen_spirv
rustc_codegen_nvvm, obviously the best codegen ;)

rustc_codegen_cranelift targets the cranelift backend, which is a codegen backend written in Rust that is faster than LLVM but does not have many optimizations compared to LLVM. rustc_codegen_llvm is obvious, it is the backend almost everybody uses which targets LLVM. rustc_codegen_gcc targets GCC (GNU Compiler Collection) which is able to target more exotic targets than LLVM, especially for embedded. rustc_codegen_spirv targets the SPIR-V (Standard Portable Intermediate Representation 5) format, which is a format mostly used for compiling shader languages such as GLSL or WGSL to a standard representation that Vulkan/OpenGL can use, the reasons why SPIR-V is not an alternative to CUDA/rustc_codegen_nvvm have been covered in the FAQ.

Finally, we come to the star of the show, rustc_codegen_nvvm. This backend targets NVVM IR for compiling Rust to GPU kernels that can be run by CUDA. What NVVM IR/libNVVM are has been covered in the CUDA section.

`rustc_codegen_ssa`

rustc_codegen_ssa is the central crate behind every single codegen backend and does much of the hard work. It abstracts away the MIR lowering logic so that custom codegen backends only have to implement some traits and the SSA codegen does everything else. For example:

A trait for getting a type like an integer type.
A trait for optimizing a module.
A trait for linking everything.
A trait for declaring a function.

And so on. You will find an SSA codegen trait in almost every file.

`rustc_codegen_nvvm`

At the highest level, our codegen workflow goes like this:

Source code -> Typechecking -> MIR -> SSA Codegen -> LLVM IR (NVVM IR) -> PTX -> PTX opts/function DCE -> Final PTX
               |                                     |                  |      |                                  ^
               |                                     |          libNVVM +------+                                  |
               |                                     |                                                            |
               |                  rustc_codegen_nvvm +------------------------------------------------------------|
         rustc +---------------------------------------------------------------------------------------------------

Before we do anything, rustc does its normal job, it typechecks, converts everything to MIR, etc. Then, rustc loads our codegen backend shared lib and invokes it to codegen the MIR. It creates an instance of NvvmCodegenBackend and it invokes codegen_crate. You could do anything inside codegen_crate but we just defer back to rustc_codegen_ssa and tell it to do the job for us:

fn codegen_crate<'tcx>(
    &self,
    tcx: TyCtxt<'tcx>,
    metadata: EncodedMetadata,
    need_metadata_module: bool,
) -> Box<dyn std::any::Any> {
    Box::new(rustc_codegen_ssa::base::codegen_crate(
        NvvmCodegenBackend,
        tcx,
        String::new(),
        metadata,
        need_metadata_module,
    ))
}

After that, the codegen logic is kind of abstracted away from us, which is a good thing! We just need to provide the SSA codegen crate whatever it needs to do its thing. This is done in the form of traits, lots and lots and lots of traits, more traits than you’ve ever seen, traits your subconscious has warned you of in nightmares, anyways. Because talking about how the SSA codegen crate works is kind of useless, we will instead talk first about general concepts and terminology, then dive into each trait.

But first, let’s talk about the end of the codegen, it is pretty simple, we do a couple of things: after codegen is done and LLVM has been run to optimize each module

We gather every LLVM bitcode module we created.
We create a new libNVVM program.
We add every bitcode module to the libNVVM program.
We try to find libdevice and add it to the program (see nvidia docs on what libdevice is).
We run the verifier on the NVVM program just to check that we did not create any invalid NVVM IR.
We run the compiler which gives us a final PTX string, hooray!
Finally, the PTX goes through a small stage where its parsed and function DCE is run to eliminate most of the bloat in the file. Traditionally this is done by the linker but there’s no linker to be found for miles here.
We write this PTX file to wherever rustc tells us to write the final file.

We will cover the libNVVM steps in more detail later on.

Codegen units (CGUs)

Ah codegen units, the thing everyone just tells you to set to 1 in Cargo.toml, but what are they? Well, to put it simply, codegen units are rustc splitting up a crate into different modules to then run LLVM in parallel over. For example, rustc can run LLVM over two different modules in parallel and save time.

This gets a little bit more complex with generics, because MIR is not monomorphized and monomorphized MIR is not a thing, the compiler monomorphizes instances on the fly. Therefore rustc needs to put any generic functions that one CGU relies on inside of the same CGU because it needs to monomorphize them.

Rlibs

rlibs are mysterious files, their origins are mysterious and their contents are the deepest layer of the iceberg. Just kidding, but rlibs often confuse people (including me at first). Rlibs are rustc’s way of encoding basically everything it needs to know about a crate into a file. Rlibs usually contain the following:

Object files for each CGU.
LLVM bitcode.
A symbol table.
Metadata:
- The rustc version (because things can go kaboom if version mismatches, ABIs are fun amirite)
- A crate hash
- A crate id
- Info about the source files
- The exported API, things like macros, traits, etc.
- MIR, for things such as generic functions and #[inline]d functions (please don’t put #[inline] on everything, rustc will cry)

Types

Types! who doesn’t love types, especially those that cause libNVVM to randomly segfault or loop forever! Anyways, types are an integral part of the codegen backend and everything revolves around them and you will see them everywhere.

rustc_codegen_ssa does not actually tell you what your type representation should be, it allows you to decide. For example, Rust GPU represents it as a SpirvType enum, while both rustc_codegen_llvm and our codegen represent it as opaque LLVM types:

type Type = &'ll llvm::Type;

llvm::Type is an opaque type that comes from llvm-c. 'll is one of the main lifetimes you will see throughout the whole codegen, it is used for anything that lasts as long as the current usage of LLVM. LLVM gives you back pointers when you ask for a type or value, some time ago rustc_codegen_llvm fully switched to using references over pointers, and we follow in their footsteps.

One important fact about types is that they are opaque, you cannot take a type and ask “is this X struct?”, this is like asking “which chickens were responsible for this omelette?”. You can ask if its a number type, a vector type, a void type, etc.

The SSA codegen crate needs to ask the backend for types for everything it needs to codegen MIR. It does this using a trait called BaseTypeMethods:

pub trait BaseTypeMethods<'tcx>: Backend<'tcx> {
    fn type_i1(&self) -> Self::Type;
    fn type_i8(&self) -> Self::Type;
    fn type_i16(&self) -> Self::Type;
    fn type_i32(&self) -> Self::Type;
    fn type_i64(&self) -> Self::Type;
    fn type_i128(&self) -> Self::Type;
    fn type_isize(&self) -> Self::Type;

    fn type_f32(&self) -> Self::Type;
    fn type_f64(&self) -> Self::Type;

    fn type_func(&self, args: &[Self::Type], ret: Self::Type) -> Self::Type;
    fn type_struct(&self, els: &[Self::Type], packed: bool) -> Self::Type;
    fn type_kind(&self, ty: Self::Type) -> TypeKind;
    fn type_ptr_to(&self, ty: Self::Type) -> Self::Type;
    fn type_ptr_to_ext(&self, ty: Self::Type, address_space: AddressSpace) -> Self::Type;
    fn element_type(&self, ty: Self::Type) -> Self::Type;

    /// Returns the number of elements in `self` if it is a LLVM vector type.
    fn vector_length(&self, ty: Self::Type) -> usize;

    fn float_width(&self, ty: Self::Type) -> usize;

    /// Retrieves the bit width of the integer type `self`.
    fn int_width(&self, ty: Self::Type) -> u64;

    fn val_ty(&self, v: Self::Value) -> Self::Type;
}

Every codegen backend implements this some way or another, you can find our implementation in ty.rs. Our implementation is pretty straightforward, LLVM has functions that we link to which get us the types we need:

impl<'ll, 'tcx> BaseTypeMethods<'tcx> for CodegenCx<'ll, 'tcx> {
    fn type_i1(&self) -> &'ll Type {
        unsafe { llvm::LLVMInt1TypeInContext(self.llcx) }
    }

    fn type_i8(&self) -> &'ll Type {
        unsafe { llvm::LLVMInt8TypeInContext(self.llcx) }
    }

    fn type_i16(&self) -> &'ll Type {
        unsafe { llvm::LLVMInt16TypeInContext(self.llcx) }
    }

    fn type_i32(&self) -> &'ll Type {
        unsafe { llvm::LLVMInt32TypeInContext(self.llcx) }
    }

    fn type_i64(&self) -> &'ll Type {
        unsafe { llvm::LLVMInt64TypeInContext(self.llcx) }
    }

    fn type_i128(&self) -> &'ll Type {
        unsafe { llvm::LLVMIntTypeInContext(self.llcx, 128) }
    }

    fn type_isize(&self) -> &'ll Type {
        self.isize_ty
    }

    fn type_f32(&self) -> &'ll Type {
        unsafe { llvm::LLVMFloatTypeInContext(self.llcx) }
    }

    fn type_f64(&self) -> &'ll Type {
        unsafe { llvm::LLVMDoubleTypeInContext(self.llcx) }
    }

    fn type_func(&self, args: &[&'ll Type], ret: &'ll Type) -> &'ll Type {
        unsafe { llvm::LLVMFunctionType(ret, args.as_ptr(), args.len() as c_uint, False) }
    }

// ...

There is also logic for handling ABI types, such as generating aggregate (struct) types

PTX generation

This is the final and most fun part of codegen, taking our LLVM bitcode and giving it to libNVVM. It is in theory as simple as just giving NVVM every single bitcode module, but in practice, we do a couple of things before and after to reduce PTX size and speed things up.

The NVVM API

libNVVM is a dynamically linked library which is distributed in every download of the CUDA SDK. If you are on windows, it should be somewhere around C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.3/nvvm/bin where v11.3 is the version of CUDA you have downloaded. On Windows it’s usually called nvvm64_40_0.dll. If you are on linux it should be somewhere around /opt/cuda/nvvm-prev/lib64/libnvvm.so. You can see its API either in the API docs or in its header file in the include folder.

We have our own high level bindings to it published as a crate called nvvm.

The libNVVM API could not be simpler, it is just a couple of functions:

Make new program
Add bitcode module
Lazy add bitcode module
Verify program
Compile program

The first step is always making a new program, a program is just a container for modules that then gets compiled.

Module loading

This is the most important part, we need to add our LLVM bitcode to the program, that should be a very simple thing that would involve no calls to random functions in the rustc haystack, …right? Why of course not, you didn’t seriously think we would make this straight-forward, right?

So, in theory it is very simple, just load the bitcode from the rlib and tell NVVM to load it. While this is easy and it works, it has its own very visible issues.

Traditionally, if you never use a function, either the compiler destroys it when using LTO, or the linker destroys it in its own dead code pass. The issue is that LTO is not always run, and we do not have a linker, NVVM is our linker. However, NVVM does not eliminate dead functions. I think you can guess why that is a problem, so unless we want 11mb PTX files (yes this is actually how big it was) we need to do something about it.

Module merging and DCE

To solve our dead code issue, we take a pretty simple approach. We merge every module (one crate maybe be multiple modules because of codegen units) into a single module to start. Then, we do the following:

(Internalize) Iterate over every global and function then:
- If the global/function is not a declaration (i.e. an extern decl) and not a kernel, then mark its linkage as internal and give it default visibility.
(Global DCE) Run the globalDCE LLVM Pass over the module. This will delete any globals/functions we do not use.

Internal linkage tells LLVM that the symbol is not externally-needed, meaning that it can delete the symbol if it is not used by other non-internal functions. In this case, our non-internal functions are kernel functions.

In the future we could probably make this even better by combining our previous lazy-loading approach, by only loading functions/modules into the module if they are used, doing so using dependency graphs.

libdevice

There are a couple of special modules we need to load before we are done, libdevice and libintrinsics. The first and most important one is libdevice, libdevice is essentially a bitcode module containing hyper-optimized math intrinsics that NVIDIA provides for us. You can find it as a .bc file in the libdevice folder inside your NVVM install location. Every function inside of it is prefixed with __nv_, you can find docs for it here.

We declare these intrinsics inside of ctx_intrinsics.rs and link to them inside cuda_std. We also use them to codegen a lot of intrinsics inside intrinsic.rs, such as sqrtf32.

libdevice is also lazy loaded so we do not import useless intrinsics.

libintrinsics

This is the last special module we load, it is simple, it is just a dumping ground for random wrapper functions we need to define that cuda_std or the codegen backend needs. You can find the LLVM IR definition for it in the codegen directory called libintrinsics.ll. All of its functions should be declared with the __nvvm_ prefix.

Compilation

Finally, we have everything loaded and we can compile our program. We do one last thing however.

NVVM has a function for verifying our program to make sure we did not add anything nvvm does not like. We run this before compilation just to be safe. Although annoyingly this does not catch all errors, NVVM just segfaults sometimes which is unfortunate.

Compiling is simple, we just call NVVM’s program compile function and panic if it fails, if it doesn’t, we get a final PTX string. We can then just write that to the file that rustc wants us to put the final item in.

Debugging the codegen backend

When you try to compile an entire language for a completely different type of hardware, stuff is bound to break. In this section we will cover how to debug 🧊, segfaults, and more.

Segfaults

Segfaults are usually caused in one of two ways:

From LLVM when calling it through FFI with some bad stuff (bad values, bad types, etc).
From NVVM when linking (generating PTX). (more common)

The first case can be debugged in two ways:

Building the codegen backend in debug mode and using RUSTC_LOG="rustc_codegen_nvvm=trace" ($env:RUSTC_LOG = "rustc_codegen_nvvm=trace"; if using powershell). Note that this will dump a LOT of output, and when I say a LOT, i am not joking, so please, pipe this to a file. This will give you a detailed summary of almost every action the codegen backend has done, you can examine the final few logs to check what the last action the codegen backend was doing before segfaulting was. This is usually straightforward because the logs are detailed.
Building LLVM 7 with debug assertions. This, coupled with logging should give all the info needed to debug a segfault. It should get LLVM to throw an exception whenever something bad happens.

The latter case is a bit worse.

Segfaults in libNVVM are generally because we gave something to libnvvm which it did not expect. In an ideal world, libnvvm would just throw a validation error, but it wouldn’t be an LLVM-based library if it threw friendly errors ;). Libnvvm has been known to segfault on things like:

using int types that arent i1, i8, i16, i32, or i64 in functions signatures. (see int_replace.rs).
having debug info on multiple modules (this is technically disallowed per the spec but it still shouldn’t segfault).

Generally there is no good way to debug these failures other than hoping libNVVM throws a validation error (which will cause an ICE). I have created a tiny tool to run llvm-extract on an LLVM IR file to attempt to isolate segfaulting functions which works to some degree which I will add to the project soon.

Miscompilations

Miscompilations are rare but annoying. They usually result in one of two things happening:

CUDA rejecting the PTX as a whole (throwing an InvalidPtx error). Run ptxas on the .ptx file to get a more informative error message. This is rare but the most common cause is declaring invalid extern functions (just grep for extern in the PTX file and check if it’s odd functions that aren’t CUDA syscalls like vprintf, malloc, free, etc).
The PTX containing invalid behavior. This is very specific and rare but if you find this, the best way to debug it is:
- Try to get a minimal working example so we don’t have to search through megabytes of LLVM IR/PTX.
- Use RUSTFLAGS="--emit=llvm-ir" and find crate_name.ll in target/nvptx64-nvidia-cuda/<debug/release>/deps/ and attach it in any bug report.
- Attach the final PTX file.

That should give you an idea of who is responsible for the miscompilation, if it is us, LLVM, or NVVM. Which should allow you to isolate the cause and file a bug report to LLVM/NVIDIA and generate different IR to avoid it.

If that doesn’t work, then it might be a bug inside of CUDA itself, but that should be very rare. The best way to debug that (and really the only way) is to set up the crate for debug (and see if it still happens in debug). Then you can run your executable under NSight Compute, go to the source tab, and examine the SASS (basically an assembly lower than PTX) to see if ptxas miscompiled it.

If you set up the codegen backend for debug, it should give you a mapping from Rust code to SASS which should hopefully help to see what exactly is breaking.

Here is an example of the screen you should see:

Supported features

This page is used for tracking Cargo/Rust and CUDA features that are currently supported or planned to be supported in the future. As well as tracking some information about how they could be supported.

Note that Not supported does not mean it won’t ever be supported, it just means we haven’t gotten around to adding it yet.

Indicator	Meaning
➖	Not Applicable
❌	Not Supported
✔️	Fully Supported
🟨	Partially Supported

Rust features

Feature Name	Support Level	Notes
Opt-Levels	✔️	behaves mostly the same (because LLVM is still used for optimizations). Except that libNVVM opts are run on anything except no-opts because NVVM only has -O0 and -O3
codegen-units	✔️
LTO	➖	we load bitcode modules lazily using dependency graphs, which then forms a single module optimized by libNVVM, so all the benefits of LTO are on without pre-libNVVM LTO being needed.
Closures	✔️
Enums	✔️
Loops	✔️
If	✔️
Match	✔️
Proc Macros	✔️
Try (`?`)	✔️
128 bit integers	🟨	Basic ops should work (and are emulated), advanced intrinsics like `ctpop`, `rotate`, etc are unsupported.
Unions	✔️
Iterators	✔️
Dynamic Dispatch	✔️
Pointer Casts	✔️
Unsized Slices	✔️
Alloc	✔️
Printing	✔️
Panicking	✔️	Currently just traps (aborts) because of weird printing failures in the panic handler
Float Ops	✔️	Maps to libdevice intrinsics, calls to libm are not intercepted though, which we may want to do in the future
Atomics	❌

CUDA libraries

Library Name	Support Level	Notes
CUDA Runtime API	➖	The CUDA Runtime API is for CUDA C++, we use the driver API
CUDA Driver API	🟨	Most functions are implemented, but there is still a lot left to wrap because it is gigantic
cuBLAS	❌	In-progress
cuFFT	❌
cuSOLVER	❌
cuRAND	➖	cuRAND only works with the runtime API, we have our own general purpose GPU rand library called `gpu_rand`
cuDNN	❌	In-progress
cuSPARSE	❌
AmgX	❌
cuTENSOR	❌
OptiX	🟨	CPU OptiX is mostly complete, GPU OptiX is still heavily in-progress because it needs support from the codegen backend

GPU-side features

Note: Most of these categories are used very rarely in CUDA code, therefore do not be alarmed that it seems like many things are not supported. We just focus on things used by the wide majority of users.

Feature Name	Support Level	Notes
Function Execution Space Specifiers	➖
Variable Memory Space Specifiers	✔️	Handled Implicitly but can be explicitly stated for statics with `#[address_space(...)]`
Built-in Vector Types	➖	Use linear algebra libraries like vek or glam
Built-in Variables	✔️
Memory Fence Instructions	✔️
Synchronization Functions	✔️
Mathematical Functions	🟨	Less common functions like native f16 math are not supported
Texture Functions	❌
Surface Functions	❌
Read-Only Data Cache Load Function	❌	No real need, immutable references hint this automatically
Load Functions Using Cache Hints	❌
Store Functions Using Cache Hints	❌
Time Function	✔️
Atomic Functions	❌
Address Space Predicate Functions	✔️	Address Spaces are implicitly handled, but they may be added for exotic interop with CUDA C/C++
Address Space Conversion Functions	✔️
Alloca Function	➖
Compiler Optimization Hint Functions	➖	Existing `core` hints work
Warp Vote Functions	❌
Warp Match Functions	❌
Warp Reduce Functions	❌
Warp Shuffle Functions	❌
Nanosleep	✔️
Warp Matrix Functions (Tensor Cores)	❌
Asynchronous Barrier	❌
Asynchronous Data Copies	❌
Profiler Counter Function	✔️
Assertion	✔️
Trap Function	✔️
Breakpoint	✔️
Formatted Output	✔️
Dynamic Global Memory Allocation	✔️
Execution Configuration	✔️
Launch Bounds	❌
Pragma Unroll	❌
SIMD Video Instructions	❌
Cooperative Groups	❌
Dynamic Parallelism	❌
Stream Ordered Memory	✔️
Graph Memory Nodes	❌
Unified Memory	✔️
`__restrict__`	➖	Not needed, you get that performance boost automatically through Rust’s noalias :)

Frequently asked questions

This page will cover a lot of the questions people often have when they encounter this project, so they are addressed all at once.

Why not use rustc with the LLVM PTX backend?

Good question, a good amount of reasons:

The LLVM PTX backend is still very much WIP and often doesn’t have things and/or breaks.
Due to odd dylib issues, the LLVM PTX backend does not work on windows, it will fail to link in intrinsics. This can be circumvented by building LLVM in a special way, but this is far beyond what most users will do to get a backend to work.
NVVM is used in NVCC itself, therefore NVIDIA is much more receptive to bugs inside of it.
NVVM contains proprietary optimizations (which is why it’s closed source) that are simply not present in the LLVM PTX backend which yield considerable performance differences (especially on more complex kernels with more information in the IR).
For some reason (either rustc giving weird LLVM IR or the LLVM PTX backend being broken) the LLVM PTX backend often generates completely invalid PTX for trivial programs, so it is not an acceptable workflow for a production pipeline.
GPU and CPU codegen is fundamentally different, creating a codegen backend that is only for the GPU allows us to seamlessly implement features which would have been impossible or very difficult to implement in the existing codegen backend, such as:
- Shared memory, this requires some special generation of globals with custom addrspaces, its just not possible to do without backend explicit handling.
- Custom linking logic to do dead code elimination so as to not end up with large PTX files full of dead functions/globals.
- Stripping away everything we do not need, no complex ABI handling, no shared lib handling, control over how function calls are generated, etc.

So overall, the LLVM PTX backend is fit for smaller kernels/projects/proofs of concept. It is however not fit for compiling an entire language (core is very big) with dependencies and more. The end goal is for Rust to be able to be used over CUDA C/C++ with the same (or better!) performance and features, therefore, we must take advantage of all optimizations NVCC has over us.

If NVVM IR is a subset of LLVM IR, can we not give rustc-generated LLVM IR to NVVM?

Short answer, no.

Long answer, there are a couple of things that make this impossible:

At the time of writing, libNVVM expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work.
NVVM IR is a subset of LLVM IR, there are tons of things that NVVM will not accept. Such as a lot of function attrs not being allowed. This is well documented and you can find the spec here. Not to mention many bugs in libNVVM that I have found along the way, the most infuriating of which is nvvm not accepting integer types that arent i1, i8, i16, i32, or i64. This required special handling in the codegen backend to convert these “irregular” types into vector types.

What is the point of using Rust if a lot of things in kernels are unsafe?

This is probably the most asked question by far, so let’s break it down in detail.

TL;DR There are things we fundamentally can’t check, but just because that is the case does not mean we cannot still prevent a lot of problems we can check.

Yes it is true that GPU kernels have much more unsafe than CPU code usually, but why is that?

The reason is that CUDA’s entire model is not based on safety in any way, there are almost zero safety nets in CUDA. Rust is the polar opposite of this model, everything is safe unless there are some invariants that cannot be checked by the compiler. Let’s take a look at some of the invariants we face here.

Take this program as an example, written in CUDA C++:

__global__ void kernel(int* buf, int* other)
{
  int idx = threadIdx.x;
  buf[idx] = other[idx];
}

int main(void)
{
  int N = 50;
  int* a, b, d_a, d_b;
  a = (int*)malloc(N*sizeof(int));
  b = (int*)malloc(N*sizeof(int));

  cudaMalloc(&d_a, N*sizeof(int));
  cudaMalloc(&d_b, N*sizeof(int));

  for (int i = 0; i < N; i++) {
    a[i] = 0.0f;
    b[i] = 2.0f;
  }

  cudaMemcpy(d_a, a, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, b, N*sizeof(float), cudaMemcpyHostToDevice);

  kernel<<<1, N>>>(d_a, d_b);

  cudaMemcpy(d_a, a, N*sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(d_b, b, N*sizeof(float), cudaMemcpyDeviceToHost);

  /* do something with the data */

  cudaFree(d_a);
  cudaFree(d_b);
  free(a);
  free(b);
}

You may think this looks innocent enough, it’s a very easy and understandable program. But if you really think about it, this is a minefield of things that could go wrong. Let’s list most of them:

buf could be too small, that is undefined behavior (reading beyond allocated memory)
similarly, other could also be too small.
The kernel could have been called with too many or not enough parameters.
The kernel could have been called with a different grid/block dimension than expected, which would cause a data race.
Any of the cudaMalloc, cudaMemcpy, kernel launches, or cudaFree calls could have errored, which we dont handle and simply ignore.
We could have forgotten to initialize the buffers.
We could have forgotten to free the buffers.

This goes to show that CUDA C/C++ and CUDA overall rely on shifting the burden of correctness from the API to the developer. However, Rust uses a completely opposite design model, the compiler verifies as much as it can, and burden is only shifted to the developer if its absolutely essential, behind unsafe.

This creates a big problem for us, it is very difficult (and sometimes impossible) to prove correctness statically when wrapping how CUDA works. We can solve a lot of the points using things like RAII and providing a high level wrapper, but we fundamentally cannot prove a lot of things, the most common place where this is shown is the CPU-GPU boundary, e.g. launching kernels.

Firstly, we cannot verify that the PTX we are calling is sound, that it has no data races, writes into the right buffers, doesnt rely on undocumented invariants, and does not write invalid data to buffers. This already makes launching kernels perma-unsafe.

Second, CUDA does zero validation in terms of kernel parameter mismatch, it will simply segfault on you, or even keep going but produce invalid data (or cause the kernel to cause undefined behavior). This is a design flaw in CUDA itself, we have no control over it and no 100% reliable way to fix it, therefore we must shift this burden of correctness to the developer.

Moreover, the CUDA GPU kernel model is entirely based on trust, trusting each thread to index into the correct place in buffers, trusting the caller of the kernel to uphold some dimension invariants, etc. This is once again, completely incompatible with how Rust does things. We can provide wrappers to calculate an index that always works, and macros to index a buffer automatically, but indexing in complex ways is a core operation in CUDA and it is impossible for us to prove that whatever the developer is doing is correct.

Finally, We would love to be able to use mut refs in kernel parameters, but this is would be unsound. Because each kernel function is technically called multiple times in parallel with the same parameters, we would be aliasing the mutable ref, which rustc declares as unsound (aliasing mechanics). So raw pointers or slightly-less-unsafe need to be used. However, they are usually only used for the initial buffer indexing, after which you can turn them into a mutable reference just fine (because you indexed in a way where no other thread will index that element). Also note that shared refs can be used as parameters just fine.

Now that we outlined why this is a thing, why is using Rust a benefit if we still need to use unsafe?

Well it’s simple, eliminating most of the things that a developer needs to think about to have a safe program is still exponentially safer than leaving everything to the developer to think about.

By using Rust, we eliminate:

The forgotten/unhandled CUDA errors problem (yay results!).
The uninitialized memory problem.
The forgetting to dealloc memory problem.
All of the inherent C++ problems in the kernel beyond the initial buffer indexing.
The mismatched grid/block dimension problem (by providing thread::index).
The forgetting to memcpy data back problem.

And countless other problems with things like graphs, streams, devices, etc.

So, just because we cannot solve every problem with CUDA safety, does not mean we cannot solve a lot of them, and ease the burden of correctness from the developer.

Besides, using Rust only adds to safety, it does not make CUDA more unsafe. This means there are only things to gain in terms of safety using Rust.

Why not use Rust GPU with compute shaders?

The reasoning for this is the same reasoning as to why you would use CUDA over opengl/vulkan compute shaders:

CUDA usually outperforms shaders if kernels are written well and launch configurations are optimal.
CUDA has many useful features such as shared memory, unified memory, graphs, fine grained thread control, streams, the PTX ISA, etc.
Rust GPU does not perform many optimizations, and with rustc_codegen_ssa’s less than ideal codegen, the optimizations by LLVM and libNVVM are needed.
SPIR-V is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed. While libNVVM (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified.
Rust GPU is primarily focused on graphical shaders, compute shaders are secondary, which the Rust ecosystem needs, but it also needs a project 100% focused on computing, and computing only.
SPIR-V cannot access many useful CUDA libraries such as OptiX, cuDNN, cuBLAS, etc.
SPIR-V debug info is still very young and Rust GPU cannot generate it. While rustc_codegen_nvvm does, which can be used for profiling kernels in something like nsight compute.

Moreover, CUDA is the primary tool used in big computing industries such as VFX and scientific computing. Therefore it is much easier for CUDA C++ users to use Rust for GPU computing if most of the concepts are still the same. Plus, we can interface with existing CUDA code by compiling it to PTX then linking it with our Rust code using the CUDA linker API (which is exposed in a high level wrapper in cust).

Why use the CUDA Driver API over the Runtime API?

Simply put, the driver API provides better control over concurrency, context, and module management, and overall has better performance control than the runtime API.

Let’s break it down into the main new concepts introduced in the Driver API.

Contexts

The first big difference in the driver API is that CUDA context management is explicit and not implicit.

Contexts are similar to CPU processes, they manage all of the resources, streams, allocations, etc associated with operations done inside them.

The driver API provides control over these contexts. You can create new contexts and drop them at any time. As opposed to the runtime API which works off of an implicit context destroyed on device reset. This causes a problem for larger applications because a new integration of CUDA could call device reset when it is finished, which causes further uses of CUDA to fail.

Modules

Modules are the second big difference in the driver API. Modules are similar to shared libraries, they contain all of the globals and functions (kernels) inside of a PTX/cubin file. The driver API is language-agnostic, it purely works off PTX/cubin files. To answer why this is important we need to cover what cubins and PTX files are briefly.

PTX is a low level assembly-like language which is the penultimate step before what the GPU actually executes. It is human-readable and you can dump it from a CUDA C++ program with nvcc ./file.cu --ptx. This PTX is then optimized and lowered into a final format called SASS (Source and Assembly) and turned into a cubin (CUDA binary) file.

Driver API modules can be loaded as either PTX, cubin, or fatbin files. If they are loaded as PTX then the driver API will JIT compile the PTX to cubin then cache it. You can also compile PTX to cubin yourself using ptx-compiler and cache it.

This pipeline provides much better control over what functions you actually need to load and cache. You can separate different functions into different modules you can load dynamically (and even dynamically reload). This can yield considerable performance benefits when dealing with a lot of functions.

Streams

Streams are (one of) CUDA’s way of dispatching multiple kernels in parallel. You can kind of think of them as OS threads essentially. Kernels dispatched one after the other inside of a particular stream will execute one after the other on the GPU, which is helpful for kernels that rely on a previous kernel’s result.

The CUDA runtime API operates off of a single global stream. This causes a lot of issues for users of large programs or libraries that need to manage many kernels being dispatched at the same time as efficiently as possible.

Why target NVIDIA GPUs only instead of using something that can work on AMD?

This is a complex issue with many arguments for both sides, so I will give you both sides as well as my opinion.

Pros for using OpenCL over CUDA:

OpenCL (mostly) works on everything because it is a specification, not an actual centralized tool.
OpenCL will be decently fast on most systems.

Cons for using OpenCL over CUDA:

Just like all open specifications, not every implementation is as good or supports the same things. Just because the absolute basics work, does not mean more exotic features work on everything because some vendors may lag behind others.
OpenCL is slow to add new features, this is a natural consequence of being an open specification many vendors need to implement. For example, OpenCL 3.0 (which was announced in around April 2020) is supported by basically nobody. NVIDIA cards support OpenCL 2.0 while AMD cards support OpenCL 2.1. This means new features cannot be reliably relied upon because they are unlikely to work on a lot of cards for a LONG time.
OpenCL can only be written in OpenCL C (based on C99), OpenCL C++ is a thing, but again, not everything supports it. This makes complex programs more difficult to create.
OpenCL has less tools and libraries.
OpenCL is nowhere near as language-agnostic as CUDA. CUDA works almost fully off of an assembly format (PTX) and debug info. Essentially how CPU code works. This makes writing language-agnostic things in OpenCL near impossible and locks you into using OpenCL C.
OpenCL is plagued with serious driver bugs which have not been fixed, or that occur only on certain vendors.

Pros for using CUDA over OpenCL:

CUDA is for the most part the industry-standard tool for “higher level” computing such as scientific or VFX computing.
CUDA is a proprietary tool, meaning that NVIDIA is able to push out bug fixes and features much faster than releasing a new spec and waiting for vendors to implement it. This allows for more features being added, such as cooperative kernels, CUDA graphs, unified memory, new profilers, etc.
CUDA is a single entity, meaning that if something does or does not work on one system it is unlikely that that will be different on another system. Assuming you are not using different architectures, where one GPU may be lacking a feature.
CUDA is usually 10-30% faster than OpenCL overall, this is likely due to subpar OpenCL drivers by NVIDIA, but it is unlikely this performance gap will change in the near future.
CUDA has a much richer set of libraries and tools than OpenCL, such as cuFFT, cuBLAS, cuRand, cuDNN, OptiX, NSight Compute, cuFile, etc.
You can seamlessly use existing CUDA C/C++ code with cust or rustc_codegen_nvvm-generated PTX by using the CUDA linker APIs which are exposed in cust. Allowing for incremental switching to Rust.
There is a generally larger set of code samples in CUDA C/C++ over OpenCL.
Documentation is far better, there are (mostly) complete API docs for every single CUDA library and function out there.
CUDA generally offers more control over the internals of how CUDA executes your GPU code. For example, you can choose to keep PTX which uses a virtual architecture, or you can compile that to cubin (SASS) and cache that for faster load times.

Cons for using CUDA over OpenCL:

CUDA only works on NVIDIA GPUs.

What makes cust and RustaCUDA different?

Cust is a fork of RustaCUDA which changes a lot of things inside of it, as well as adds new features that are not inside of RustaCUDA.

The most significant changes (This list is not complete!!) are:

Drop code no longer panics on failure to drop raw CUDA handles, this is so that InvalidAddress errors, which cause CUDA to nuke the driver and nuke any memory allocations no longer cause piles of panics from device boxes trying to be dropped when returning from the function with ?.
cuda-sys is no longer used, instead, we have our own bindings cust_raw so we can ensure updates to the latest CUDA features.
CUDA occupancy functions have been added.
PTX linking functions have been added.
Native support for vek linear algebra types for grid/block dimensions and DeviceCopy has been added under the vek feature.
Util traits have been added.
Basic graph support has been added.
Some functions have been renamed.
Some functions have been added.

Changes that are currently in progress but not done/experimental:

Surfaces
Textures
Graphs
PTX validation

Just like RustaCUDA, cust makes no assumptions of what language was used to generate the PTX/cubin. It could be C, C++, futhark, or best of all, Rust!

Cust’s name is literally just Rust + CUDA mashed together in a horrible way. Or you can pretend it stands for custard if you really like custard.

Keyboard shortcuts

The Rust CUDA Guide