Skip to content

pplmx/nova

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

211 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

CUDA Parallel Algorithms Library

Generated by CI Coverage Status

A production-ready CUDA parallel algorithms library with a five-layer architecture, supporting education, extensibility, and production use cases.

Architecture

Five-Layer Design

┌─────────────────────────────────────────────────────────────┐
│  Layer 3: High-Level API (STL-style)                        │
│  cuda::reduce(), cuda::sort()                              │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│  Layer 2: Algorithm Wrappers                                │
│  cuda::algo::reduce_sum(), memory management               │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│  Layer 1: Device Kernels                                   │
│  Pure __global__ kernels, no memory allocation             │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│  Layer 0: Memory Foundation                                │
│  Buffer<T>, unique_ptr<T>, MemoryPool, Allocator concepts   │
└─────────────────────────────────────────────────────────────┘

Directory Structure

include/cuda/
├── memory/               # Layer 0: Memory Foundation
│   ├── buffer.h         # cuda::memory::Buffer<T>
│   ├── unique_ptr.h     # cuda::memory::unique_ptr<T>
│   ├── memory_pool.h    # MemoryPool for allocation
│   └── allocator.h      # Allocator concepts
├── device/              # Layer 1: Device Kernels
│   ├── reduce_kernels.h
│   ├── scan_kernels.h
│   └── device_utils.h   # CUDA_CHECK, warp_reduce
├── algo/                 # Layer 2: Algorithm Wrappers
│   ├── reduce.h
│   ├── scan.h
│   └── sort.h
└── api/                  # Layer 3: High-Level API
    ├── device_vector.h   # STL-style device container
    ├── stream.h          # Stream and Event wrappers
    └── config.h          # Algorithm configuration objects

include/
├── image/               # Image processing
│   ├── types.h
│   ├── brightness.h
│   ├── gaussian_blur.h
│   ├── sobel_edge.h
│   └── morphology.h
├── parallel/            # Parallel primitives
│   ├── scan.h
│   ├── sort.h
│   └── histogram.h
├── matrix/              # Matrix operations
│   ├── add.h
│   ├── mult.h
│   └── ops.h
└── convolution/         # Convolution
    └── conv2d.h

src/
├── memory/               # Layer 0 implementations
├── cuda/
│   ├── device/           # Layer 1 implementations
│   └── algo/             # Layer 2 implementations
├── image/
├── parallel/
├── matrix/
└── convolution/

Layer Responsibilities

Layer Namespace Purpose Dependencies
Layer 0 cuda::memory Memory allocation, RAII, pooling CUDA runtime
Layer 1 cuda::device Pure device kernels Layer 0
Layer 2 cuda::algo Memory management, algorithms Layers 0, 1
Layer 3 cuda::api STL-style containers Layers 0, 1, 2

Quick Start

Build (Make)

git clone https://github.com/pplmx/nova.git
cd nova
cmake -B build -DNOVA_ENABLE_NCCL=OFF
cmake --build build --parallel

Build (Ninja - Faster)

cmake -G Ninja -B build-ninja -DNOVA_ENABLE_NCCL=OFF
cmake --build build-ninja --parallel

Run Demo

./build/bin/nova  # or ./build-ninja/bin/nova

Run Tests

cd build-ninja
ctest -j16        # Parallel tests (GPU memory limited to 16)

Usage Examples

Layer 0: Memory Foundation

#include "cuda/memory/buffer.h"

// RAII memory management
cuda::memory::Buffer<int> buf(1024);
buf.copy_from(host_data.data(), 1024);

// Memory pool for efficiency
cuda::memory::MemoryPool pool({.block_size = 1 << 20});
auto buf2 = pool.allocate(1024);

Layer 2: Algorithm API

#include "cuda/algo/reduce.h"

// Use layered API
int sum = cuda::algo::reduce_sum(d_input, N);
int max = cuda::algo::reduce_max(d_input, N);

Layer 3: High-Level API

#include "cuda/api/device_vector.h"
#include "cuda/api/stream.h"
#include "cuda/api/config.h"

// DeviceVector - STL-style container
cuda::api::DeviceVector<int> d_vec(N);
d_vec.copy_from(input);
int sum = cuda::algo::reduce_sum(d_vec.data(), d_vec.size());

// Stream - RAII async operations
cuda::api::Stream stream;
stream.synchronize();

// Config - algorithm configuration
auto config = cuda::api::ReduceConfig::optimized_config();

Modules

Module Files Description
cuda::memory Buffer, unique_ptr, MemoryPool, Allocator Memory management
cuda::device device_utils, reduce_kernels Pure CUDA kernels
cuda::algo reduce wrappers, device_buffer Algorithm orchestration
cuda::api DeviceVector, Stream, Event, Config High-level API
image types, brightness, gaussian_blur, sobel, morphology Image processing
parallel scan, sort, histogram Parallel primitives
matrix add, mult, ops Matrix operations
convolution conv2d 2D convolution

Testing

505 tests across multiple test suites, 99%+ passing:

# Full test suite
ctest -j16

# Single test
./bin/nova-tests --gtest_filter="BufferTest.*"

# v1.4 specific tests
./bin/nova-tests --gtest_filter="*MpiContext*:*TopologyMap*:*MultiNodeContext*"

Test Configuration

Option Default Description
NOVA_ENABLE_NCCL ON Enable NCCL collectives (requires NCCL)
NOVA_ENABLE_MPI OFF Enable MPI multi-node support
NOVA_ENABLE_UNITY_BUILD ON Faster compilation via unity builds
CTEST_PARALLEL_LEVEL NCPU Test parallelism (capped at 16 for GPU memory)

Development

Build Options

Generator Command Speed
Ninja cmake -G Ninja -B build Fastest
Make cmake -B build Standard

Build Targets

Target Description
cmake --build <dir> Build project (use --parallel for multi-core)
ctest -j<N> Run tests in parallel
make clean Clean build artifacts

Requirements

  • CUDA Toolkit 20+
  • CMake 4.0+
  • C++23 compatible compiler
  • CUDA-capable GPU
  • (Optional) NCCL 2.25+ for multi-GPU collectives
  • (Optional) MPI 3.1+ for multi-node support
  • (Optional) Ninja for faster builds

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

See CONTRIBUTING.md.

About

A CUDA parallel algorithms library

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors