Primary index

This page describes the purpose of the primary index.

What is the primary index?

Aerospike Database creates a primary index automatically for each namespace, in order to index its data, with a metadata entry for every record in the namespace. This allows Aerospike to provide consistent low-latency access to any record in the database, regardless of the number of records or the total size of the data.

Index metadata

Each primary index metadata entry consumes 64 bytes of storage. The primary index storage is configurable in Aerospike Enterprise Edition (EE) - in memory (DRAM), in persistent memory (PMem), or on an NVMe SSD. The default is for the primary index to live in memory.

In addition to a 20-byte record digest, record metadata includes the following:

The generation count increments each time the record is written; used for resolving conflicting updates.
The void-time tracks when a key expires; used by the expiraiton and eviction subsystem.
The Last Update Time (LUT) tracks when the record was last written; used for conflict resolution during cold restart and migration (depending on your configuration settings), to filter records using expressions, for incremental backup scans, for truncate commands, for cross-datacenter replication (XDR), and more.
Replicaiton state information; used for strong-consistency namespace.
The exact location of the record in the data storage device; this supports retrieving the record with a single read IO.

Index structure

Aerospike’s primary index is a blend of a distributed hash table with a distributed tree structure in each server. The entire keyspace in the namespace is separated using a robust hash function into partitions. A total of 4096 partitions are evenly distributed across cluster nodes. The replication-factor namespace configuration determines how many copies are kept in replica partitions, which are never on the same node as the master partition. See data-distribution for details on hashing and partitioning.

When a cluster node fails, the indexes on other nodes, where the replica partitions live, are immediately available. If the failed node remains down, data starts rebalancing through data migration, and indexes are built on the new partitions in each node.

Sprigs

Aerospike uses a red-black tree structure called a sprig. You can configure the number of sprigs for each partition. Configuring the right number of sprigs is a trade-off between extra space overhead and optimized parallel access.

At the lowest level, Aerospike uses a self-balancing in-memory red-black tree structure per partition. Levels increase logarithmically to the number of records.

There are two distinct ways this can impact the performance:

Tree traversal - On a node holding a large number of records (>1 million per partition) under high throughput (>1 million transactions per second), index tree traversal could become the bottleneck. If the tree is 20 or more levels deep, the tree search latency can become most of the transaction’s latency.
Lock contention - On a node with even modest throughput, if there is a large number of records, traversing an index tree can take a very long time – over 1 second. While reads and overwrites don’t contend with the reduce lock, creates and deletes do. Even at a small create/delete throughput, a one second blockage of all creates and deletes on a single partition can quickly block all (or most) transaction threads, holding up all other transactions (even if those would not themselves have been blocked). This can be exacerbated in the case of an NSUP cycle generating a lot of deletes (expirations/evictions) while scans or migrations are in progress (also requiring partition reduce lock).

A general answer to both these problems is to divide a partition’s index tree into multiple sub-trees (or ‘sprigs’) to reduce tree depth and size and shorten search and traversal time, also to support the presence of multiple tree and to reduce lock pairs per partition, with each lock applying to one or more sub-trees, to reduce lock contention.

Index persistence

The primary index is derived from the data itself and can be rebuilt from that data, depending on the configuration setting for fast restart (AKA warmstart). Fast restart enables upgrades with minimal downtime in Aerospike EE variants.

Fast restart allocates index memory from a shared memory segment (shmem). For planned shutdowns and restarts, for an upgrade for example, the server re-attaches to the shared memory segment and activates the primary indexes on restart without a data scan of the storage.

Index storage

Where the server stores a primary index is determined by the index-type configuration parameter. The following options are available:

Type	Description
`shmem`	Linux shared memory.
`flash`	A block storage device, typically NVMe SSD.
`pmem`	Persistent memory, such as Intel Optane DC Persistent Memory.

For more information about primary index storage methods, see Configure the primary index.