Compute Capability Gating

This section covers how to write code that adapts to different CUDA compute capabilities using conditional compilation.

What are Compute Capabilities?

CUDA GPUs have different "compute capabilities" that determine which features they support. Each capability is identified by a version number like 3.5, 5.0, 6.1, 7.5, etc. Higher numbers generally mean more features are available.

For example:

Compute capability 5.0+ supports 64-bit integer min/max and bitwise atomic operations
Compute capability 6.0+ supports double-precision (f64) atomic operations
Compute capability 7.0+ supports tensor core operations

For comprehensive details, see NVIDIA's CUDA documentation on GPU architectures.

Virtual vs Real Architectures

In CUDA terminology:

Virtual architectures (compute_XX) define the PTX instruction set and available features
Real architectures (sm_XX) represent actual GPU hardware

rust-cuda works exclusively with virtual architectures since it only generates PTX. The NvvmArch::ComputeXX enum values correspond to CUDA's virtual architectures.

Using Target Features

When building your kernel, the NvvmArch::ComputeXX variant you choose enables specific target_feature flags. These can be used with #[cfg(...)] to conditionally compile code based on the capabilities of the target GPU.

For example, this checks whether the target architecture supports running compute 6.0 code or newer:

#![allow(unused)]
fn main() {
#[cfg(target_feature = "compute_60")]
}

Think of it as asking: “Is the GPU I’m building for at least compute 6.0?” Depending on which NvvmArch::ComputeXX is used to build the kernel, there is a different answer:

Building for Compute60 → ✓ Yes (exact match)
Building for Compute70 → ✓ Yes (7.0 GPUs support 6.0 code)
Building for Compute50 → ✗ No (5.0 GPUs can't run 6.0 code)

These features let you write optimized code paths for specific GPU generations while still supporting older ones.

Specifying Compute Capabilites

Starting with CUDA 12.9, NVIDIA introduced architecture suffixes that affect compatibility.

Base Architecture (No Suffix)

Example: NvvmArch::Compute70

This is everything mentioned above, and was the only option in CUDA 12.8 and lower.

When to use: Default choice for maximum compatibility.

Example usage:

#![allow(unused)]
fn main() {
// In build.rs
CudaBuilder::new("kernels")
    .arch(NvvmArch::Compute70)
    .build()
    .unwrap();

// In your kernel code:
#[cfg(target_feature = "compute_60")]  // ✓ Pass (older compute capability)
#[cfg(target_feature = "compute_70")]  // ✓ Pass (current compute capability)
#[cfg(target_feature = "compute_80")]  // ✗ Fail (newer compute capability)
}

Family Suffix ('f')

Example: NvvmArch::Compute101f

Specifies code compatible with the same major compute capability version and with an equal or higher minor compute capability version.

When to use: When you need features from a specific minor version but want forward compatibility within the family.

Example usage:

#![allow(unused)]
fn main() {
// In build.rs
CudaBuilder::new("kernels")
    .arch(NvvmArch::Compute101f)
    .build()
    .unwrap();

// In your kernel code:
#[cfg(target_feature = "compute_100")]   // ✗ Fail (10.0 < 10.1)
#[cfg(target_feature = "compute_101")]   // ✓ Pass (equal major, equal minor)
#[cfg(target_feature = "compute_103")]   // ✓ Pass (equal major, greater minor)
#[cfg(target_feature = "compute_101f")]  // ✓ Pass (the 'f' variant itself)
#[cfg(target_feature = "compute_100f")]  // ✗ Fail (other 'f' variant)
#[cfg(target_feature = "compute_90")]    // ✗ Fail (different major)
#[cfg(target_feature = "compute_110")]   // ✗ Fail (different major)
}

Architecture Suffix ('a')

Example: NvvmArch::Compute100a

Specifies code that only runs on GPUs of that specific compute capability and no others. However, during compilation, it enables all available instructions for the architecture, including all base variants up to the same version and all family variants with the same major version and equal or lower minor version.

When to use: When you need to use architecture-specific features (like certain Tensor Core operations) that are only available on that exact GPU model.

Example usage:

#![allow(unused)]
fn main() {
// In build.rs
CudaBuilder::new("kernels")
    .arch(NvvmArch::Compute100a)
    .build()
    .unwrap();

// In your kernel code:
#[cfg(target_feature = "compute_100a")]  // ✓ Pass (the 'a' variant itself)
#[cfg(target_feature = "compute_100")]   // ✓ Pass (base variant)
#[cfg(target_feature = "compute_90")]    // ✓ Pass (lower base variant)
#[cfg(target_feature = "compute_100f")]  // ✓ Pass (family variant with same major/minor)
#[cfg(target_feature = "compute_101f")]  // ✗ Fail (family variant with higher minor)
#[cfg(target_feature = "compute_110")]   // ✗ Fail (higher major version)
}

Note: While the 'a' variant enables all these features during compilation (allowing you to use all available instructions), the generated PTX code will still only run on the exact GPU architecture specified.

For more details on suffixes, see NVIDIA's blog post on family-specific architecture features.

Manual Compilation (Without CudaBuilder)

If you're invoking rustc directly instead of using cuda_builder, you only need to specify the architecture through LLVM args:

rustc --target nvptx64-nvidia-cuda \
    -C llvm-args=-arch=compute_61 \
    -Z codegen-backend=/path/to/librustc_codegen_nvvm.so \
    ...

Or with cargo:

export RUSTFLAGS="-C llvm-args=-arch=compute_61 -Z codegen-backend=/path/to/librustc_codegen_nvvm.so"
cargo build --target nvptx64-nvidia-cuda

The codegen backend automatically synthesizes target features based on the architecture type as described above.

Common Patterns for Base Architectures

These patterns work when using base architectures (no suffix), which enable all lower capabilities:

At Least a Capability (Default)

#![allow(unused)]
fn main() {
// Code that requires compute 6.0 or higher
#[cfg(target_feature = "compute_60")]
{
    cuda_std::atomic::atomic_add(data, 1.0); // f64 atomics need 6.0+
}
}

Exactly One Capability

#![allow(unused)]
fn main() {
// Code that targets exactly compute 6.1 (not 6.2+)
#[cfg(all(target_feature = "compute_61", not(target_feature = "compute_62")))]
{
    // Features specific to compute 6.1
}
}

Up To a Maximum Capability

#![allow(unused)]
fn main() {
// Code that works up to compute 6.0 (not 6.1+)
#[cfg(all(target_feature = "compute_35", not(target_feature = "compute_61")))]
{
    // Maximum compatibility implementation
}
}

Targeting Specific Architecture Ranges

#![allow(unused)]
fn main() {
// This block compiles when building for architectures >= 6.0 but < 8.0
#[cfg(all(target_feature = "compute_60", not(target_feature = "compute_80")))]
{
    // Code here can use features from 6.0+ but must not use 8.0+ features
}
}

Debugging Capability Issues

If you encounter errors about missing functions or features:

Check the compute capability you're targeting in cuda_builder
Verify your GPU supports the features you're using
Use nvidia-smi to check your GPU's compute capability
Add appropriate #[cfg] guards or increase the target architecture

Runtime Behavior

Again, rust-cuda only generates PTX, not pre-compiled GPU binaries ("fatbinaries"). This PTX is then JIT-compiled by the CUDA driver at runtime.

For more details, see NVIDIA's documentation on GPU compilation and JIT compilation.

Keyboard shortcuts

GPU Computing with Rust using CUDA