Add Augmented Core Extraction Algorithm#1404
Add Augmented Core Extraction Algorithm#1404rapids-bot[bot] merged 142 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test 38c03a4 |
- Adds the out-of-tree ACE method of @anaruse. This assumes graphs smaller than host memory. - Adds disk_enabled` and `graph_build_dir` parameters to select ACE method.
- Use partitions instead of clusters in ACE to distinguish between ACE clusters and regular KNN graph building clusters.
- Introduced dynamic configuration of nprobes and nlists for IVF-PQ based on partition size to improve KNN graph construction. - Added logging for both IVF-PQ and NN-Descent parameters to provide better insights during the graph building process. - Ensured default parameters are set when no specific graph build parameters are provided.
- Added logic to identify and merge small partitions that do not meet the minimum size requirement for stable KNN graph construction.
- Replaced `disk_enabled` and `graph_build_dir` with `ace_npartitions` and `ace_build_dir` in the parameter parsing logic. - Updated function signatures and documentation to clarify the new partitioning approach for ACE builds.
- Introduced new functions for reordering and storing datasets on disk, optimizing for NVMe performance. - Clarified namings.
- Updated CAGRA and HNSW index implementations to utilize file descriptors for managing disk-backed indices. - Removed the `cuvsCagraIndexGetFileDirectory` function and related documentation, as it is no longer needed. - Introduced `cuvsHnswIndexIsOnDisk` function to check if the HNSW index is stored on disk. - Enhanced error handling for searching on disk-stored indices in both CAGRA and HNSW.
- From_hnsw_params sets the build method already
achirkin
left a comment
There was a problem hiding this comment.
Thanks for pushing to implement my file-descriptor suggestions! Only now I've noticed converting from linux file descriptors to the streams is not as convenient as it should be. I have some suggestion for this below.
- Refactor ACE dataset handling - Improve buffer flushing - Add file existence checks - Enhance throughput calculations
- use_disk is the user provided parameter. This might be false but use_disk_mode can be set to true in case of memory pressure.
- Add stream buffer support for the file descriptor Co-authored-by: Artem M. Chirkin <9253178+achirkin@users.noreply.github.com>
Thank you for these suggestions. I have applied them and removed the |
The issues brought up by Artem were addressed. Since Artem is on leave, I am dismissing his requests for changes.
|
/merge |
This PR introduces **Augmented Core Extraction (ACE)**, an approach proposed by @anaruse for building CAGRA indices on very large datasets that exceed GPU memory capacity. ACE enables users to build high-quality approximate nearest neighbor search indices on datasets that would otherwise be impossible to process on a single GPU. The approach uses the host memory if large enough and falls back to the disk if required. This work is a collaboration: @anaruse, @tfeher, @achirkin, @mfoerste4 ## Algorithm Description 1. **Dataset Partitioning**: The dataset is partitioned using balanced k-means clustering on sampled data. Each vector is assigned to its two closest partition centroids (primary and augmented). The primary partitions are non-overlapping. The augmentation ensures that cross-partition edges are captured in the final graph. Partitions smaller than a minimum threshold are automatically merged with larger partitions to ensure computational efficiency and graph quality. Vectors from small partitions are reassigned to the nearest valid partitions. 2. **Per-Partition Graph Building**: For each partition, a sub-index is built independently (regular `build_knn_graph()` flow) with its primary vectors plus augmented vectors from neighboring partitions. 3. **Graph Combining**: The per-partition graphs are combined into a single unified CAGRA index. Merging is not needed since the primary partitions are non-overlapping. The in-memory variant remaps the local partition IDs to global dataset IDs to create a correct index. The disk variant stores the backward index mappings (`dataset_mapping.bin`), the reordered dataset (`reordered_dataset.bin`) and the optimized CAGRA graph (`cagra_graph.bin`) on disk. The index is then incomplete as show by `cuvs::neighbors::index::on_disk()`. The files are stored in `cuvs::neighbors::index::file_directory()`. The HNSW index serialization was provided by @mfoerste4 in rapidsai#1410, which was merged here. This adds the `serialize_to_hnsw()` serialization routine that allows combination of dataset, graph, and mapping. The data will be combined on-the-fly while streamed from disk to disk while trying to minimize the required host memory. The host needs enough memory to hold the index though. ## Core Components - **`ace_build()`**: Main routine which users should call. - **`ace_get_partition_labels()`**: Performs balanced k-means clustering to assign each vector to two closest partitions while handling small partition merging. - **`ace_create_forward_and_backward_lists()`**: Creates bidirectional ID mappings between original dataset indices and reordered partition-local indices. - **`ace_set_index_params()`**: Set the index parameters based on the partition and augmented dataset to ensure an efficient KNN graph building. - **`ace_gather_partition_dataset()`**: In-memory only: gather the partition and augmented dataset. - **`ace_adjust_sub_graph_ids`**: In-memory only: Adjust ids in sub search graph and store them into the main search graph. - **`ace_adjust_final_graph_ids`**: In-memory only: Map graph neighbor IDs from reordered space back to original vector IDs. - **`ace_reorder_and_store_dataset`**: Disk only: Reorder the dataset based on partitions and store to disk. Uses write buffers to improve performance. - **`ace_load_partition_dataset_from_disk`**: Disk only: Load partition dataset and augmented dataset from disk. - **`file_descriptor` and `ace_read_large_file()` / `ace_write_large_file()`**: RAII file handle and chunked file I/O operations. - **CAGRA index changes**: Added `on_disk_` flag and `file_directory_` to the CAGRA index structure to support disk-backed indices. - **CAGRA parameter changes**: Added `ace_npartitions` and `ace_build_dir` to the CAGRA parameters for users to specify that ACE should be used and which directory should be used if required. ## Usage ### C++ API ```cpp #include <cuvs/neighbors/cagra.hpp> using namespace cuvs::neighbors; // Configure index parameters cagra::index_params params; params.ace_npartitions = 10; // Number of partitions (unset or <= 1 to disable ACE) params.ace_build_dir = "/tmp/ace_build"; // Directory for intermediate files (should be a fast NVMe) params.graph_degree = 64; params.intermediate_graph_degree = 128; // Build ACE index (dataset can be on host memory) auto dataset = raft::make_host_matrix<float, int64_t>(n_rows, n_cols); // ... load dataset ... auto index = cagra::build_ace(res, params, dataset.view(), params.ace_npartitions); // Search works identically to standard CAGRA if the host has enough memory (index.on_disk() == false) cagra::search_params search_params; auto neighbors = raft::make_device_matrix<uint32_t>(res, n_queries, k); auto distances = raft::make_device_matrix<float>(res, n_queries, k); cagra::search(res, search_params, index, queries, neighbors.view(), distances.view()); ``` ### Storage Requirements 1. `cagra_graph.bin`: `n_rows * graph_degree * sizeof(IdxT)` 2. `dataset_mapping.bin`: `n_rows * sizeof(IdxT)` 2. `reordered_dataset.bin`: Size of the input dataset 3. `augmented_dataset.bin`: Size of the input dataset Authors: - Julian Miller (https://github.com/julianmi) - Anupam (https://github.com/aamijar) - Tarang Jain (https://github.com/tarang-jain) - Malte Förster (https://github.com/mfoerste4) - Jake Awe (https://github.com/AyodeAwe) - Bradley Dice (https://github.com/bdice) - Artem M. Chirkin (https://github.com/achirkin) - Jinsol Park (https://github.com/jinsolp) Approvers: - MithunR (https://github.com/mythrocks) - Robert Maynard (https://github.com/robertmaynard) - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1404
Changes introduced with #1404 were not formatted correctly Running spotless (via build.sh) fixed the formatting. CC @mythrocks @cjnolet Authors: - Lorenzo Dematté (https://github.com/ldematte) - Jake Awe (https://github.com/AyodeAwe) - gpuCI (https://github.com/GPUtester) - MithunR (https://github.com/mythrocks) Approvers: - MithunR (https://github.com/mythrocks) URL: #1539
This PR introduces Augmented Core Extraction (ACE), an approach proposed by @anaruse for building CAGRA indices on very large datasets that exceed GPU memory capacity. ACE enables users to build high-quality approximate nearest neighbor search indices on datasets that would otherwise be impossible to process on a single GPU. The approach uses the host memory if large enough and falls back to the disk if required.
This work is a collaboration: @anaruse, @tfeher, @achirkin, @mfoerste4
Algorithm Description
build_knn_graph()flow) with its primary vectors plus augmented vectors from neighboring partitions.dataset_mapping.bin), the reordered dataset (reordered_dataset.bin) and the optimized CAGRA graph (cagra_graph.bin) on disk. The index is then incomplete as show bycuvs::neighbors::index::on_disk(). The files are stored incuvs::neighbors::index::file_directory(). The HNSW index serialization was provided by @mfoerste4 in [WIP] Add disk2disk serialization foe ACE Algorithm #1410, which was merged here. This adds theserialize_to_hnsw()serialization routine that allows combination of dataset, graph, and mapping. The data will be combined on-the-fly while streamed from disk to disk while trying to minimize the required host memory. The host needs enough memory to hold the index though.Core Components
ace_build(): Main routine which users should call.ace_get_partition_labels(): Performs balanced k-means clustering to assign each vector to two closest partitions while handling small partition merging.ace_create_forward_and_backward_lists(): Creates bidirectional ID mappings between original dataset indices and reordered partition-local indices.ace_set_index_params(): Set the index parameters based on the partition and augmented dataset to ensure an efficient KNN graph building.ace_gather_partition_dataset(): In-memory only: gather the partition and augmented dataset.ace_adjust_sub_graph_ids: In-memory only: Adjust ids in sub search graph and store them into the main search graph.ace_adjust_final_graph_ids: In-memory only: Map graph neighbor IDs from reordered space back to original vector IDs.ace_reorder_and_store_dataset: Disk only: Reorder the dataset based on partitions and store to disk. Uses write buffers to improve performance.ace_load_partition_dataset_from_disk: Disk only: Load partition dataset and augmented dataset from disk.file_descriptorandace_read_large_file()/ace_write_large_file(): RAII file handle and chunked file I/O operations.on_disk_flag andfile_directory_to the CAGRA index structure to support disk-backed indices.ace_npartitionsandace_build_dirto the CAGRA parameters for users to specify that ACE should be used and which directory should be used if required.Usage
C++ API
Storage Requirements
cagra_graph.bin:n_rows * graph_degree * sizeof(IdxT)dataset_mapping.bin:n_rows * sizeof(IdxT)reordered_dataset.bin: Size of the input datasetaugmented_dataset.bin: Size of the input dataset