A GPU-Accelerated Binary Vector Index
Published 2024-6-12
This is a vector index I built that supports insertion and k-nearest neighbors (k-NN) querying, optimized for GPUs. It operates entirely in CUDA and can process queries on half a billion vectors in under 200 milliseconds. The codebase is structured as a standalone library with an HTTP API for remote access. It’s intended for high-performance search tasks—think similarity search, AI model retrieval, or reinforcement learning replay buffers. The codebase is located at https://github.com/rodlaf/BinaryGPUIndex.
How It Works
At its core, this is a GPU-native data structure for storing and retrieving vectors efficiently. The current implementation supports 64-bit binary vectors and uses Hamming distance for similarity, but it can be extended to support floating-point vectors with cosine similarity.
Insertion
Vectors are inserted in batches, each tagged with a UUID. The insertion process updates both the GPU memory and an on-disk index, allowing for persistence across restarts. The system pre-allocates memory to avoid frequent reallocation overhead.
Querying
The k-NN search is implemented with a custom radix selection algorithm that avoids full sorting. Instead of sorting all distances, it efficiently selects only the k-smallest elements, cutting down unnecessary computation.
API Design
I exposed the index as a simple HTTP server using Crow (a lightweight C++ web framework). The server has two endpoints:
POST /insert
: Inserts a batch of vectors.POST /query
: Returns the k-nearest neighbors to a given vector.
The payloads are JSON-encoded, making the system easy to integrate into any workflow.
Example Insert Request
{
"vectors": [
{"id": "4d1027ec-80b7-4df3-b950-ae824fadbd61", "values": "100000101111...011011"},
{"id": "e78241cc-5bc6-4532-8b7c-76809c2704bd", "values": "110010110000...110010"}
]
}
Example Query Request
{
"topK": 1000,
"vector": "110011111110...0100"
}
Response:
{
"matches": [
{"id": "8ea44221-707e-4b26-815a-90bb60339401", "distance": 0.12598, "values": "110011111110...0100"},
{"id": "770d9f87-7a81-484d-95f9-5ca3321a6028", "distance": 0.12598, "values": "110011010110...0100"}
]
}
Performance Benchmarks
I tested the system on an NVIDIA A10G GPU with 500 million vectors. Here’s what I found:
Operation | Time (ms) |
---|---|
Open index (500M vectors) | 195,572 |
Query (k=1000) | 193 |
Insert (batch of 1000) | 114 |
The index loads in ~3 minutes, but once it’s in memory, query times stay under 200ms for large k.
Future Work
- Better insert handling: Right now, inserting duplicate IDs just appends them again. A proper upsert mechanism would help.
- Support for larger vector types: Floating point vectors are doable but require a different distance metric.
- Distributed indexing: Scaling across multiple GPUs would enable even larger datasets.
Conclusion
This project was an experiment in building high-speed, GPU-accelerated retrieval systems. It runs fast, scales well, and integrates cleanly via an API. The next step is deciding whether to optimize further or generalize it for other vector types.