CUDA Parallel Algorithms Library

A production-ready CUDA parallel algorithms library with a five-layer architecture, supporting education, extensibility, and production use cases.

Architecture

Five-Layer Design

┌─────────────────────────────────────────────────────────────┐
│  Layer 3: High-Level API (STL-style)                        │
│  cuda::reduce(), cuda::sort()                              │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│  Layer 2: Algorithm Wrappers                                │
│  cuda::algo::reduce_sum(), memory management               │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│  Layer 1: Device Kernels                                   │
│  Pure __global__ kernels, no memory allocation             │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│  Layer 0: Memory Foundation                                │
│  Buffer<T>, unique_ptr<T>, MemoryPool, Allocator concepts   │
└─────────────────────────────────────────────────────────────┘

Directory Structure

include/cuda/
├── memory/               # Layer 0: Memory Foundation
│   ├── buffer.h         # cuda::memory::Buffer<T>
│   ├── unique_ptr.h     # cuda::memory::unique_ptr<T>
│   ├── memory_pool.h    # MemoryPool for allocation
│   └── allocator.h      # Allocator concepts
├── device/              # Layer 1: Device Kernels
│   ├── reduce_kernels.h
│   ├── scan_kernels.h
│   └── device_utils.h   # CUDA_CHECK, warp_reduce
├── algo/                 # Layer 2: Algorithm Wrappers
│   ├── reduce.h
│   ├── scan.h
│   └── sort.h
└── api/                  # Layer 3: High-Level API
    ├── device_vector.h   # STL-style device container
    ├── stream.h          # Stream and Event wrappers
    └── config.h          # Algorithm configuration objects

include/
├── image/               # Image processing
│   ├── types.h
│   ├── brightness.h
│   ├── gaussian_blur.h
│   ├── sobel_edge.h
│   └── morphology.h
├── parallel/            # Parallel primitives
│   ├── scan.h
│   ├── sort.h
│   └── histogram.h
├── matrix/              # Matrix operations
│   ├── add.h
│   ├── mult.h
│   └── ops.h
└── convolution/         # Convolution
    └── conv2d.h

src/
├── memory/               # Layer 0 implementations
├── cuda/
│   ├── device/           # Layer 1 implementations
│   └── algo/             # Layer 2 implementations
├── image/
├── parallel/
├── matrix/
└── convolution/

Layer Responsibilities

Layer	Namespace	Purpose	Dependencies
Layer 0	`cuda::memory`	Memory allocation, RAII, pooling	CUDA runtime
Layer 1	`cuda::device`	Pure device kernels	Layer 0
Layer 2	`cuda::algo`	Memory management, algorithms	Layers 0, 1
Layer 3	`cuda::api`	STL-style containers	Layers 0, 1, 2

Quick Start

Build (Make)

git clone https://github.com/pplmx/nova.git
cd nova
cmake -B build -DNOVA_ENABLE_NCCL=OFF
cmake --build build --parallel

Build (Ninja - Faster)

cmake -G Ninja -B build-ninja -DNOVA_ENABLE_NCCL=OFF
cmake --build build-ninja --parallel

Run Demo

./build/bin/nova  # or ./build-ninja/bin/nova

Run Tests

cd build-ninja
ctest -j16        # Parallel tests (GPU memory limited to 16)

Usage Examples

Layer 0: Memory Foundation

#include "cuda/memory/buffer.h"

// RAII memory management
cuda::memory::Buffer<int> buf(1024);
buf.copy_from(host_data.data(), 1024);

// Memory pool for efficiency
cuda::memory::MemoryPool pool({.block_size = 1 << 20});
auto buf2 = pool.allocate(1024);

Layer 2: Algorithm API

#include "cuda/algo/reduce.h"

// Use layered API
int sum = cuda::algo::reduce_sum(d_input, N);
int max = cuda::algo::reduce_max(d_input, N);

Layer 3: High-Level API

#include "cuda/api/device_vector.h"
#include "cuda/api/stream.h"
#include "cuda/api/config.h"

// DeviceVector - STL-style container
cuda::api::DeviceVector<int> d_vec(N);
d_vec.copy_from(input);
int sum = cuda::algo::reduce_sum(d_vec.data(), d_vec.size());

// Stream - RAII async operations
cuda::api::Stream stream;
stream.synchronize();

// Config - algorithm configuration
auto config = cuda::api::ReduceConfig::optimized_config();

Modules

Module	Files	Description
cuda::memory	Buffer, unique_ptr, MemoryPool, Allocator	Memory management
cuda::device	device_utils, reduce_kernels	Pure CUDA kernels
cuda::algo	reduce wrappers, device_buffer	Algorithm orchestration
cuda::api	DeviceVector, Stream, Event, Config	High-level API
image	types, brightness, gaussian_blur, sobel, morphology	Image processing
parallel	scan, sort, histogram	Parallel primitives
matrix	add, mult, ops	Matrix operations
convolution	conv2d	2D convolution

Testing

505 tests across multiple test suites, 99%+ passing:

# Full test suite
ctest -j16

# Single test
./bin/nova-tests --gtest_filter="BufferTest.*"

# v1.4 specific tests
./bin/nova-tests --gtest_filter="*MpiContext*:*TopologyMap*:*MultiNodeContext*"

Test Configuration

Option	Default	Description
`NOVA_ENABLE_NCCL`	ON	Enable NCCL collectives (requires NCCL)
`NOVA_ENABLE_MPI`	OFF	Enable MPI multi-node support
`NOVA_ENABLE_UNITY_BUILD`	ON	Faster compilation via unity builds
`CTEST_PARALLEL_LEVEL`	NCPU	Test parallelism (capped at 16 for GPU memory)

Development

Build Options

Generator	Command	Speed
Ninja	`cmake -G Ninja -B build`	Fastest
Make	`cmake -B build`	Standard

Build Targets

Target	Description
`cmake --build <dir>`	Build project (use `--parallel` for multi-core)
`ctest -j<N>`	Run tests in parallel
`make clean`	Clean build artifacts

Requirements

CUDA Toolkit 20+
CMake 4.0+
C++23 compatible compiler
CUDA-capable GPU
(Optional) NCCL 2.25+ for multi-GPU collectives
(Optional) MPI 3.1+ for multi-node support
(Optional) Ninja for faster builds

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

See CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
.clangd		.clangd
.github		.github
.planning		.planning
.vscode		.vscode
benchmark		benchmark
cmake		cmake
data		data
docs		docs
examples		examples
include		include
scripts		scripts
src		src
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Doxyfile		Doxyfile
LICENSE-MIT		LICENSE-MIT
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
autotune_config.json		autotune_config.json
compose.yml		compose.yml
test_autotune_cache.json		test_autotune_cache.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Parallel Algorithms Library

Architecture

Five-Layer Design

Directory Structure

Layer Responsibilities

Quick Start

Build (Make)

Build (Ninja - Faster)

Run Demo

Run Tests

Usage Examples

Layer 0: Memory Foundation

Layer 2: Algorithm API

Layer 3: High-Level API

Modules

Testing

Test Configuration

Development

Build Options

Build Targets

Requirements

License

Contribution

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA Parallel Algorithms Library

Architecture

Five-Layer Design

Directory Structure

Layer Responsibilities

Quick Start

Build (Make)

Build (Ninja - Faster)

Run Demo

Run Tests

Usage Examples

Layer 0: Memory Foundation

Layer 2: Algorithm API

Layer 3: High-Level API

Modules

Testing

Test Configuration

Development

Build Options

Build Targets

Requirements

License

Contribution

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages