A production-ready CUDA parallel algorithms library with a five-layer architecture, supporting education, extensibility, and production use cases.
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: High-Level API (STL-style) │
│ cuda::reduce(), cuda::sort() │
└─────────────────────────────────────────────────────────────┘
▲
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Algorithm Wrappers │
│ cuda::algo::reduce_sum(), memory management │
└─────────────────────────────────────────────────────────────┘
▲
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Device Kernels │
│ Pure __global__ kernels, no memory allocation │
└─────────────────────────────────────────────────────────────┘
▲
┌─────────────────────────────────────────────────────────────┐
│ Layer 0: Memory Foundation │
│ Buffer<T>, unique_ptr<T>, MemoryPool, Allocator concepts │
└─────────────────────────────────────────────────────────────┘
include/cuda/
├── memory/ # Layer 0: Memory Foundation
│ ├── buffer.h # cuda::memory::Buffer<T>
│ ├── unique_ptr.h # cuda::memory::unique_ptr<T>
│ ├── memory_pool.h # MemoryPool for allocation
│ └── allocator.h # Allocator concepts
├── device/ # Layer 1: Device Kernels
│ ├── reduce_kernels.h
│ ├── scan_kernels.h
│ └── device_utils.h # CUDA_CHECK, warp_reduce
├── algo/ # Layer 2: Algorithm Wrappers
│ ├── reduce.h
│ ├── scan.h
│ └── sort.h
└── api/ # Layer 3: High-Level API
├── device_vector.h # STL-style device container
├── stream.h # Stream and Event wrappers
└── config.h # Algorithm configuration objects
include/
├── image/ # Image processing
│ ├── types.h
│ ├── brightness.h
│ ├── gaussian_blur.h
│ ├── sobel_edge.h
│ └── morphology.h
├── parallel/ # Parallel primitives
│ ├── scan.h
│ ├── sort.h
│ └── histogram.h
├── matrix/ # Matrix operations
│ ├── add.h
│ ├── mult.h
│ └── ops.h
└── convolution/ # Convolution
└── conv2d.h
src/
├── memory/ # Layer 0 implementations
├── cuda/
│ ├── device/ # Layer 1 implementations
│ └── algo/ # Layer 2 implementations
├── image/
├── parallel/
├── matrix/
└── convolution/
| Layer | Namespace | Purpose | Dependencies |
|---|---|---|---|
| Layer 0 | cuda::memory |
Memory allocation, RAII, pooling | CUDA runtime |
| Layer 1 | cuda::device |
Pure device kernels | Layer 0 |
| Layer 2 | cuda::algo |
Memory management, algorithms | Layers 0, 1 |
| Layer 3 | cuda::api |
STL-style containers | Layers 0, 1, 2 |
git clone https://github.com/pplmx/nova.git
cd nova
cmake -B build -DNOVA_ENABLE_NCCL=OFF
cmake --build build --parallelcmake -G Ninja -B build-ninja -DNOVA_ENABLE_NCCL=OFF
cmake --build build-ninja --parallel./build/bin/nova # or ./build-ninja/bin/novacd build-ninja
ctest -j16 # Parallel tests (GPU memory limited to 16)#include "cuda/memory/buffer.h"
// RAII memory management
cuda::memory::Buffer<int> buf(1024);
buf.copy_from(host_data.data(), 1024);
// Memory pool for efficiency
cuda::memory::MemoryPool pool({.block_size = 1 << 20});
auto buf2 = pool.allocate(1024);#include "cuda/algo/reduce.h"
// Use layered API
int sum = cuda::algo::reduce_sum(d_input, N);
int max = cuda::algo::reduce_max(d_input, N);#include "cuda/api/device_vector.h"
#include "cuda/api/stream.h"
#include "cuda/api/config.h"
// DeviceVector - STL-style container
cuda::api::DeviceVector<int> d_vec(N);
d_vec.copy_from(input);
int sum = cuda::algo::reduce_sum(d_vec.data(), d_vec.size());
// Stream - RAII async operations
cuda::api::Stream stream;
stream.synchronize();
// Config - algorithm configuration
auto config = cuda::api::ReduceConfig::optimized_config();| Module | Files | Description |
|---|---|---|
| cuda::memory | Buffer, unique_ptr, MemoryPool, Allocator | Memory management |
| cuda::device | device_utils, reduce_kernels | Pure CUDA kernels |
| cuda::algo | reduce wrappers, device_buffer | Algorithm orchestration |
| cuda::api | DeviceVector, Stream, Event, Config | High-level API |
| image | types, brightness, gaussian_blur, sobel, morphology | Image processing |
| parallel | scan, sort, histogram | Parallel primitives |
| matrix | add, mult, ops | Matrix operations |
| convolution | conv2d | 2D convolution |
505 tests across multiple test suites, 99%+ passing:
# Full test suite
ctest -j16
# Single test
./bin/nova-tests --gtest_filter="BufferTest.*"
# v1.4 specific tests
./bin/nova-tests --gtest_filter="*MpiContext*:*TopologyMap*:*MultiNodeContext*"| Option | Default | Description |
|---|---|---|
NOVA_ENABLE_NCCL |
ON | Enable NCCL collectives (requires NCCL) |
NOVA_ENABLE_MPI |
OFF | Enable MPI multi-node support |
NOVA_ENABLE_UNITY_BUILD |
ON | Faster compilation via unity builds |
CTEST_PARALLEL_LEVEL |
NCPU | Test parallelism (capped at 16 for GPU memory) |
| Generator | Command | Speed |
|---|---|---|
| Ninja | cmake -G Ninja -B build |
Fastest |
| Make | cmake -B build |
Standard |
| Target | Description |
|---|---|
cmake --build <dir> |
Build project (use --parallel for multi-core) |
ctest -j<N> |
Run tests in parallel |
make clean |
Clean build artifacts |
- CUDA Toolkit 20+
- CMake 4.0+
- C++23 compatible compiler
- CUDA-capable GPU
- (Optional) NCCL 2.25+ for multi-GPU collectives
- (Optional) MPI 3.1+ for multi-node support
- (Optional) Ninja for faster builds
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
See CONTRIBUTING.md.