Add Augmented Core Extraction Algorithm by julianmi · Pull Request #1404 · rapidsai/cuvs

julianmi · 2025-10-02T15:42:30Z

This PR introduces Augmented Core Extraction (ACE), an approach proposed by @anaruse for building CAGRA indices on very large datasets that exceed GPU memory capacity. ACE enables users to build high-quality approximate nearest neighbor search indices on datasets that would otherwise be impossible to process on a single GPU. The approach uses the host memory if large enough and falls back to the disk if required.

This work is a collaboration: @anaruse, @tfeher, @achirkin, @mfoerste4

Algorithm Description

Dataset Partitioning: The dataset is partitioned using balanced k-means clustering on sampled data. Each vector is assigned to its two closest partition centroids (primary and augmented). The primary partitions are non-overlapping. The augmentation ensures that cross-partition edges are captured in the final graph. Partitions smaller than a minimum threshold are automatically merged with larger partitions to ensure computational efficiency and graph quality. Vectors from small partitions are reassigned to the nearest valid partitions.
Per-Partition Graph Building: For each partition, a sub-index is built independently (regular build_knn_graph() flow) with its primary vectors plus augmented vectors from neighboring partitions.
Graph Combining: The per-partition graphs are combined into a single unified CAGRA index. Merging is not needed since the primary partitions are non-overlapping. The in-memory variant remaps the local partition IDs to global dataset IDs to create a correct index. The disk variant stores the backward index mappings (dataset_mapping.bin), the reordered dataset (reordered_dataset.bin) and the optimized CAGRA graph (cagra_graph.bin) on disk. The index is then incomplete as show by cuvs::neighbors::index::on_disk(). The files are stored in cuvs::neighbors::index::file_directory(). The HNSW index serialization was provided by @mfoerste4 in [WIP] Add disk2disk serialization foe ACE Algorithm #1410, which was merged here. This adds the serialize_to_hnsw() serialization routine that allows combination of dataset, graph, and mapping. The data will be combined on-the-fly while streamed from disk to disk while trying to minimize the required host memory. The host needs enough memory to hold the index though.

Core Components

ace_build(): Main routine which users should call.
ace_get_partition_labels(): Performs balanced k-means clustering to assign each vector to two closest partitions while handling small partition merging.
ace_create_forward_and_backward_lists(): Creates bidirectional ID mappings between original dataset indices and reordered partition-local indices.
ace_set_index_params(): Set the index parameters based on the partition and augmented dataset to ensure an efficient KNN graph building.
ace_gather_partition_dataset(): In-memory only: gather the partition and augmented dataset.
ace_adjust_sub_graph_ids: In-memory only: Adjust ids in sub search graph and store them into the main search graph.
ace_adjust_final_graph_ids: In-memory only: Map graph neighbor IDs from reordered space back to original vector IDs.
ace_reorder_and_store_dataset: Disk only: Reorder the dataset based on partitions and store to disk. Uses write buffers to improve performance.
ace_load_partition_dataset_from_disk: Disk only: Load partition dataset and augmented dataset from disk.
file_descriptor and ace_read_large_file() / ace_write_large_file(): RAII file handle and chunked file I/O operations.
CAGRA index changes: Added on_disk_ flag and file_directory_ to the CAGRA index structure to support disk-backed indices.
CAGRA parameter changes: Added ace_npartitions and ace_build_dir to the CAGRA parameters for users to specify that ACE should be used and which directory should be used if required.

Usage

C++ API

#include <cuvs/neighbors/cagra.hpp>

using namespace cuvs::neighbors;

// Configure index parameters
cagra::index_params params;
params.ace_npartitions = 10;  // Number of partitions (unset or <= 1 to disable ACE)
params.ace_build_dir = "/tmp/ace_build";  // Directory for intermediate files (should be a fast NVMe)
params.graph_degree = 64;
params.intermediate_graph_degree = 128;

// Build ACE index (dataset can be on host memory)
auto dataset = raft::make_host_matrix<float, int64_t>(n_rows, n_cols);
// ... load dataset ...

auto index = cagra::build_ace(res, params, dataset.view(), params.ace_npartitions);

// Search works identically to standard CAGRA if the host has enough memory (index.on_disk() == false)
cagra::search_params search_params;
auto neighbors = raft::make_device_matrix<uint32_t>(res, n_queries, k);
auto distances = raft::make_device_matrix<float>(res, n_queries, k);
cagra::search(res, search_params, index, queries, neighbors.view(), distances.view());

Storage Requirements

cagra_graph.bin: n_rows * graph_degree * sizeof(IdxT)
dataset_mapping.bin: n_rows * sizeof(IdxT)
reordered_dataset.bin: Size of the input dataset
augmented_dataset.bin: Size of the input dataset

copy-pr-bot · 2025-10-02T15:42:46Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

copy-pr-bot · 2025-10-02T15:44:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

tfeher · 2025-10-06T08:31:50Z

/ok to test 38c03a4

@anaruse

- Adds the out-of-tree ACE method of @anaruse. This assumes graphs smaller than host memory. - Adds disk_enabled` and `graph_build_dir` parameters to select ACE method.

- Use partitions instead of clusters in ACE to distinguish between ACE clusters and regular KNN graph building clusters.

- Introduced dynamic configuration of nprobes and nlists for IVF-PQ based on partition size to improve KNN graph construction. - Added logging for both IVF-PQ and NN-Descent parameters to provide better insights during the graph building process. - Ensured default parameters are set when no specific graph build parameters are provided.

- Added logic to identify and merge small partitions that do not meet the minimum size requirement for stable KNN graph construction.

- Replaced `disk_enabled` and `graph_build_dir` with `ace_npartitions` and `ace_build_dir` in the parameter parsing logic. - Updated function signatures and documentation to clarify the new partitioning approach for ACE builds.

- Introduced new functions for reordering and storing datasets on disk, optimizing for NVMe performance. - Clarified namings.

- Updated CAGRA and HNSW index implementations to utilize file descriptors for managing disk-backed indices. - Removed the `cuvsCagraIndexGetFileDirectory` function and related documentation, as it is no longer needed. - Introduced `cuvsHnswIndexIsOnDisk` function to check if the HNSW index is stored on disk. - Enhanced error handling for searching on disk-stored indices in both CAGRA and HNSW.

cpp/include/cuvs/neighbors/cagra.hpp

- From_hnsw_params sets the build method already

cpp/include/cuvs/neighbors/cagra.hpp

achirkin

Thanks for pushing to implement my file-descriptor suggestions! Only now I've noticed converting from linux file descriptors to the streams is not as convenient as it should be. I have some suggestion for this below.

cpp/include/cuvs/util/file_io.hpp

cpp/include/cuvs/neighbors/cagra.hpp

cpp/include/cuvs/util/file_io.hpp

cpp/include/cuvs/neighbors/cagra.hpp

- Refactor ACE dataset handling - Improve buffer flushing - Add file existence checks - Enhance throughput calculations

- use_disk is the user provided parameter. This might be false but use_disk_mode can be set to true in case of memory pressure.

- Add stream buffer support for the file descriptor Co-authored-by: Artem M. Chirkin <9253178+achirkin@users.noreply.github.com>

tfeher

Thanks @julianmi for addressing the issues. LGTM.

julianmi · 2025-11-13T13:23:12Z

Thanks for pushing to implement my file-descriptor suggestions! Only now I've noticed converting from linux file descriptors to the streams is not as convenient as it should be. I have some suggestion for this below.

Thank you for these suggestions. I have applied them and removed the on_disk() helper as you suggested.

examples/cpp/src/cagra_hnsw_ace_example.cu

The issues brought up by Artem were addressed. Since Artem is on leave, I am dismissing his requests for changes.

tfeher · 2025-11-13T21:53:27Z

/merge

@anaruse

This PR introduces **Augmented Core Extraction (ACE)**, an approach proposed by @anaruse for building CAGRA indices on very large datasets that exceed GPU memory capacity. ACE enables users to build high-quality approximate nearest neighbor search indices on datasets that would otherwise be impossible to process on a single GPU. The approach uses the host memory if large enough and falls back to the disk if required. This work is a collaboration: @anaruse, @tfeher, @achirkin, @mfoerste4 ## Algorithm Description 1. **Dataset Partitioning**: The dataset is partitioned using balanced k-means clustering on sampled data. Each vector is assigned to its two closest partition centroids (primary and augmented). The primary partitions are non-overlapping. The augmentation ensures that cross-partition edges are captured in the final graph. Partitions smaller than a minimum threshold are automatically merged with larger partitions to ensure computational efficiency and graph quality. Vectors from small partitions are reassigned to the nearest valid partitions. 2. **Per-Partition Graph Building**: For each partition, a sub-index is built independently (regular `build_knn_graph()` flow) with its primary vectors plus augmented vectors from neighboring partitions. 3. **Graph Combining**: The per-partition graphs are combined into a single unified CAGRA index. Merging is not needed since the primary partitions are non-overlapping. The in-memory variant remaps the local partition IDs to global dataset IDs to create a correct index. The disk variant stores the backward index mappings (`dataset_mapping.bin`), the reordered dataset (`reordered_dataset.bin`) and the optimized CAGRA graph (`cagra_graph.bin`) on disk. The index is then incomplete as show by `cuvs::neighbors::index::on_disk()`. The files are stored in `cuvs::neighbors::index::file_directory()`. The HNSW index serialization was provided by @mfoerste4 in rapidsai#1410, which was merged here. This adds the `serialize_to_hnsw()` serialization routine that allows combination of dataset, graph, and mapping. The data will be combined on-the-fly while streamed from disk to disk while trying to minimize the required host memory. The host needs enough memory to hold the index though. ## Core Components - **`ace_build()`**: Main routine which users should call. - **`ace_get_partition_labels()`**: Performs balanced k-means clustering to assign each vector to two closest partitions while handling small partition merging. - **`ace_create_forward_and_backward_lists()`**: Creates bidirectional ID mappings between original dataset indices and reordered partition-local indices. - **`ace_set_index_params()`**: Set the index parameters based on the partition and augmented dataset to ensure an efficient KNN graph building. - **`ace_gather_partition_dataset()`**: In-memory only: gather the partition and augmented dataset. - **`ace_adjust_sub_graph_ids`**: In-memory only: Adjust ids in sub search graph and store them into the main search graph. - **`ace_adjust_final_graph_ids`**: In-memory only: Map graph neighbor IDs from reordered space back to original vector IDs. - **`ace_reorder_and_store_dataset`**: Disk only: Reorder the dataset based on partitions and store to disk. Uses write buffers to improve performance. - **`ace_load_partition_dataset_from_disk`**: Disk only: Load partition dataset and augmented dataset from disk. - **`file_descriptor` and `ace_read_large_file()` / `ace_write_large_file()`**: RAII file handle and chunked file I/O operations. - **CAGRA index changes**: Added `on_disk_` flag and `file_directory_` to the CAGRA index structure to support disk-backed indices. - **CAGRA parameter changes**: Added `ace_npartitions` and `ace_build_dir` to the CAGRA parameters for users to specify that ACE should be used and which directory should be used if required. ## Usage ### C++ API ```cpp #include <cuvs/neighbors/cagra.hpp> using namespace cuvs::neighbors; // Configure index parameters cagra::index_params params; params.ace_npartitions = 10; // Number of partitions (unset or <= 1 to disable ACE) params.ace_build_dir = "/tmp/ace_build"; // Directory for intermediate files (should be a fast NVMe) params.graph_degree = 64; params.intermediate_graph_degree = 128; // Build ACE index (dataset can be on host memory) auto dataset = raft::make_host_matrix<float, int64_t>(n_rows, n_cols); // ... load dataset ... auto index = cagra::build_ace(res, params, dataset.view(), params.ace_npartitions); // Search works identically to standard CAGRA if the host has enough memory (index.on_disk() == false) cagra::search_params search_params; auto neighbors = raft::make_device_matrix<uint32_t>(res, n_queries, k); auto distances = raft::make_device_matrix<float>(res, n_queries, k); cagra::search(res, search_params, index, queries, neighbors.view(), distances.view()); ``` ### Storage Requirements 1. `cagra_graph.bin`: `n_rows * graph_degree * sizeof(IdxT)` 2. `dataset_mapping.bin`: `n_rows * sizeof(IdxT)` 2. `reordered_dataset.bin`: Size of the input dataset 3. `augmented_dataset.bin`: Size of the input dataset Authors: - Julian Miller (https://github.com/julianmi) - Anupam (https://github.com/aamijar) - Tarang Jain (https://github.com/tarang-jain) - Malte Förster (https://github.com/mfoerste4) - Jake Awe (https://github.com/AyodeAwe) - Bradley Dice (https://github.com/bdice) - Artem M. Chirkin (https://github.com/achirkin) - Jinsol Park (https://github.com/jinsolp) Approvers: - MithunR (https://github.com/mythrocks) - Robert Maynard (https://github.com/robertmaynard) - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1404

@mythrocks

Changes introduced with #1404 were not formatted correctly Running spotless (via build.sh) fixed the formatting. CC @mythrocks @cjnolet Authors: - Lorenzo Dematté (https://github.com/ldematte) - Jake Awe (https://github.com/AyodeAwe) - gpuCI (https://github.com/GPUtester) - MithunR (https://github.com/mythrocks) Approvers: - MithunR (https://github.com/mythrocks) URL: #1539

julianmi requested a review from a team as a code owner October 2, 2025 15:42

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Oct 2, 2025

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Oct 2, 2025

julianmi marked this pull request as draft October 2, 2025 15:42

julianmi changed the base branch from branch-25.10 to branch-25.12 October 2, 2025 15:43

julianmi force-pushed the ace-disk branch from 5a756d4 to 62b6bea Compare October 2, 2025 15:44

tfeher added feature request New feature or request non-breaking Introduces a non-breaking change labels Oct 6, 2025

tfeher self-requested a review October 6, 2025 08:36

mfoerste4 mentioned this pull request Oct 6, 2025

[WIP] Add disk2disk serialization foe ACE Algorithm #1410

Closed

julianmi added 17 commits October 6, 2025 14:48

Integrate @anaruse's ACE method for large graphs

5e11ce6

- Adds the out-of-tree ACE method of @anaruse. This assumes graphs smaller than host memory. - Adds disk_enabled` and `graph_build_dir` parameters to select ACE method.

ACE: Clarify partition naming

2d64ff3

- Use partitions instead of clusters in ACE to distinguish between ACE clusters and regular KNN graph building clusters.

ACE: Implement merging of small partitions

4a779a4

- Added logic to identify and merge small partitions that do not meet the minimum size requirement for stable KNN graph construction.

ACE: Update parameters to clarify ace method usage

9f5d31c

- Replaced `disk_enabled` and `graph_build_dir` with `ace_npartitions` and `ace_build_dir` in the parameter parsing logic. - Updated function signatures and documentation to clarify the new partitioning approach for ACE builds.

ACE: Add timinings

5552873

ACE: Remove unused vector_fwd_list_1 in build_ace

68a0ad8

ACE: Check if we have enough host memory

0e86d18

ACE: Restructure parameter setting

12b0366

ACE: Restructure small partition merging

b0f1d04

ACE: Refactor partition data gathering

e25375f

ACE: Refactor forward backward list creation

434bb4d

ACE: Refactor id adjusting of sub search graph

3b1010f

ACE: Refactor id adjusting of final search graph

31431ba

ACE: Refactor partition label handling and dataset storage

1c53df2

- Introduced new functions for reordering and storing datasets on disk, optimizing for NVMe performance. - Clarified namings.

ACE: Improve file I/O speeds

128031b

ACE: Reduce logging

8efde31

julianmi added 2 commits November 10, 2025 14:22

Merge branch 'main' into ace-disk

b845653

achirkin reviewed Nov 10, 2025

View reviewed changes

cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved

julianmi added 3 commits November 10, 2025 16:03

ACE: Remove ace_set_index_params

faaa492

- From_hnsw_params sets the build method already

Merge branch 'main' into ace-disk

48d7f2a

Merge branch 'main' into ace-disk

ec097d1

tfeher requested changes Nov 11, 2025

View reviewed changes

cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved

achirkin previously requested changes Nov 12, 2025

View reviewed changes

julianmi and others added 9 commits November 12, 2025 17:02

ACE: Move file_io and host_memory headers

b3b5ddb

ACE: Add checks for indices on disk

5e7bd00

ACE: Minor improvements

756d8fb

- Refactor ACE dataset handling - Improve buffer flushing - Add file existence checks - Enhance throughput calculations

ACE: Use use_disk_mode instead of use_disk

e4c440a

- use_disk is the user provided parameter. This might be false but use_disk_mode can be set to true in case of memory pressure.

ACE: Align graph degree in Java test

235710c

Merge branch 'main' into ace-disk

d4f7f42

ACE: Improve file descriptor

4df5f6e

- Add stream buffer support for the file descriptor Co-authored-by: Artem M. Chirkin <9253178+achirkin@users.noreply.github.com>

ACE: Remove on_disk()

105b208

Merge branch 'main' into ace-disk

d2eb731

tfeher approved these changes Nov 13, 2025

View reviewed changes

ACE: Move helpers into their own compilation unit

4461637

tfeher reviewed Nov 13, 2025

View reviewed changes

examples/cpp/src/cagra_hnsw_ace_example.cu Outdated Show resolved Hide resolved

ACE: Use disk-mode in example

e290130

rapids-bot bot merged commit 27bb22e into rapidsai:main Nov 13, 2025
160 of 164 checks passed

github-project-automation bot moved this from Todo to Done in Vector Search, ML, & Data Mining Release Board Nov 13, 2025

ldematte mentioned this pull request Nov 14, 2025

[Java] Fix format with spotless #1539

Merged

tfeher mentioned this pull request Jan 22, 2026

[FEA] Improve CAGRA-HNSW index conversion #762

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Augmented Core Extraction Algorithm#1404

Add Augmented Core Extraction Algorithm#1404
rapids-bot[bot] merged 142 commits intorapidsai:mainfrom
julianmi:ace-disk

julianmi commented Oct 2, 2025 •

edited

Loading

copy-pr-bot bot commented Oct 2, 2025

copy-pr-bot bot commented Oct 2, 2025

tfeher commented Oct 6, 2025

Uh oh!

Uh oh!

achirkin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tfeher left a comment

julianmi commented Nov 13, 2025

Uh oh!

tfeher commented Nov 13, 2025

Uh oh!

Labels

12 participants

Conversation

julianmi commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Algorithm Description

Core Components

Usage

C++ API

Storage Requirements

copy-pr-bot bot commented Oct 2, 2025

copy-pr-bot bot commented Oct 2, 2025

tfeher commented Oct 6, 2025

Uh oh!

Uh oh!

achirkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

julianmi commented Nov 13, 2025

Uh oh!

tfeher commented Nov 13, 2025

Uh oh!

Labels

12 participants

julianmi commented Oct 2, 2025 •

edited

Loading