Rust CUDA May 2025 project update
Rust CUDA enables you to write and run CUDA kernels in Rust, executing directly on NVIDIA GPUs using NVVM IR.
Work is ongoing in the project and we wanted to share an update.
To follow along or get involved, check out the rust-cuda
repo on GitHub.
New Docker images
Thanks to @adamcavendish, we now automatically build and publish Docker images as part of CI. These images are based on NVIDIA's official CUDA containers and come preconfigured to build and run Rust GPU kernels.
Rust CUDA uses NVVM under the hood, which is NVIDIA's LLVM-based CUDA frontend. NVVM is currently based on LLVM 7 and getting it set up manually can be tedious and error-prone. These images solve the setup issue.
Improved constant memory handling
Background
CUDA exposes distinct memory spaces, each with different characteristics:
Memory Space | Scope | Speed | Size | Use Case |
---|---|---|---|---|
Registers | Per thread | Fastest | Very small | Thread-local temporaries |
Shared memory | Per block | Fast | ~48 KB | Inter-thread communication within a block |
Constant memory | Device-wide | Fast read | 64 KB total | Read-only values broadcast to all threads |
Global memory | Device-wide | Slower | GBs | General-purpose read/write memory |
CUDA C++ code is often monolithic with minimal abstraction and everything in one file.
Rust CUDA brings idiomatic Rust to GPU programming and encourages modularity, traits,
generics, and reuse of third-party no_std
crates from crates.io.
As a result, CUDA programs written in Rust tend to be more complex and depend on more
static data spread across your code and its dependencies.
A good example is
curve25519-dalek
, a
cryptographic crate that defines large static lookup tables for scalar multiplication
and point decompression. These values are immutable and read-only—ideal for constant
memory—but together exceed the 64 KB limit. Using curve25519-dalek
as a dependency
means your kernel's static data will never entirely fit in constant memory.
The issue
Previously, Rust CUDA would try to place all eligible static values into constant memory
automatically. If you had too many, or one was too big, your kernel would break at
runtime and CUDA would return an IllegalAddress
error with no clear cause.
Manual placement via #[cuda_std::address_space(constant)]
or
#[cuda_std::address_space(global)]
was possible, but only for code you controlled. The
annotations did not help for dependencies pulled from crates.io. This made it dangerous
to use larger crates or write more modular GPU programs as at any point they might tip
over the 64 KB limit and start throwing runtime errors.
This situation had the potential to create frustrating and difficult-to-diagnose bugs. For example:
- Adding a new
no_std
crate to a project could inadvertently push total static data size over the constant memory limit, causing crashes. This could happen even if the new crate's functionality was not directly invoked, simply due to the inclusion of its static data. - A kernel might function correctly in one build configuration but fail in another if different features or Cargo flags led to changes in which static variables were included in the final binary.
- If a large static variable was initially unused, the compiler might optimize it away. If subsequent code changes caused that static to be referenced, it would be included, potentially triggering the memory limit and causing runtime failures.
- Code behavior could vary unexpectedly across different versions of a dependency or between debug and release builds.
The fix
New contributor @brandonros and Rust CUDA maintainer @LegNeato landed a change that avoids those pitfalls with a conservative default and a safe opt-in mechanism:
-
By default all statics are placed in global memory.
-
A new opt-in flag,
--use-constant-memory-space
, enables automatic placement in constant memory. -
If a static is too large it is spilled to global memory automatically, even when the flag is enabled.
-
Manual overrides with
#[cuda_std::address_space(constant)]
or#[cuda_std::address_space(global)]
still work and take precedence.
This gives developers some level of control without the risk of unstable runtime behavior.
Future work
This change prevents runtime errors and hard-to-debug issues but may reduce performance in some cases by not fully utilizing constant memory.
The long-term goal is to make automatic constant memory placement smarter so we can turn it on by default without breaking user code. To get there, we need infrastructure to support correct and tunable placement logic.
Planned improvements include:
- Tracking total constant memory usage across all static variables during codegen.
- Spilling based on cumulative usage, not just individual static size.
- Failing at compile time when the limit is exceeded, especially for manually annotated statics.
- Compiler warnings when usage is close to the 64 KB limit, perhaps with a configurable range.
- User-defined packing policies, such as prioritizing constant placement of small or large statics, or statics from a particular crate.
These should give developers control and enable using profiling data or usage frequency to drive placement decisions for maximum performance.
If these improvements sound interesting to you, join us in issue #218. We're always looking for new contributors!
Updated examples and CI
@giantcow, @jorge-ortega, @adamcavendish, and @LegNeato fixed broken examples, cleaned up CI, and added a new GEMM example. These steady improvements are important to keep the project healthy and usable.
Cleaned up bindings
CUDA libraries ship as binary objects, typically wrapped in Rust using -sys
crates.
With many subframeworks like cuDNN,
cuBLAS, and
OptiX, maintaining these crates requires
generating bindings automatically via
bindgen
.
@adamcavendish and @jorge-ortega streamlined our bindgen
setup to simplify maintenance and make subframeworks easier to include or exclude.
Call for contributors
We need your help to shape the future of CUDA programming in Rust. Whether you're a maintainer, contributor, or user, there's an opportunity to get involved. We're especially interested in adding maintainers to make the project sustainable.
Be aware that the process may be a bit bumpy as we are still getting the project in order.
If you'd prefer to focus on non-proprietary and multi-vendor platforms, check out our related Rust GPU project. It is similar to Rust CUDA but targets SPIR-V for Vulkan GPUs.