A GPU-Accelerated Binary Vector Index

Published 2024-6-12


This is a vector index I built that supports insertion and k-nearest neighbors (k-NN) querying, optimized for GPUs. It operates entirely in CUDA and can process queries on half a billion vectors in under 200 milliseconds. The codebase is structured as a standalone library with an HTTP API for remote access. It’s intended for high-performance search tasks—think similarity search, AI model retrieval, or reinforcement learning replay buffers. The codebase is located at https://github.com/rodlaf/BinaryGPUIndex.

How It Works

At its core, this is a GPU-native data structure for storing and retrieving vectors efficiently. The current implementation supports 64-bit binary vectors and uses Hamming distance for similarity, but it can be extended to support floating-point vectors with cosine similarity.

Insertion

Vectors are inserted in batches, each tagged with a UUID. The insertion process updates both the GPU memory and an on-disk index, allowing for persistence across restarts. The system pre-allocates memory to avoid frequent reallocation overhead.

Querying

The k-NN search is implemented with a custom radix selection algorithm that avoids full sorting. Instead of sorting all distances, it efficiently selects only the k-smallest elements, cutting down unnecessary computation.

API Design

I exposed the index as a simple HTTP server using Crow (a lightweight C++ web framework). The server has two endpoints:

  • POST /insert: Inserts a batch of vectors.
  • POST /query: Returns the k-nearest neighbors to a given vector.

The payloads are JSON-encoded, making the system easy to integrate into any workflow.

Example Insert Request

{
    "vectors": [
        {"id": "4d1027ec-80b7-4df3-b950-ae824fadbd61", "values": "100000101111...011011"},
        {"id": "e78241cc-5bc6-4532-8b7c-76809c2704bd", "values": "110010110000...110010"}
    ]
}

Example Query Request

{
    "topK": 1000,
    "vector": "110011111110...0100"
}

Response:

{
    "matches": [
        {"id": "8ea44221-707e-4b26-815a-90bb60339401", "distance": 0.12598, "values": "110011111110...0100"},
        {"id": "770d9f87-7a81-484d-95f9-5ca3321a6028", "distance": 0.12598, "values": "110011010110...0100"}
    ]
}

Performance Benchmarks

I tested the system on an NVIDIA A10G GPU with 500 million vectors. Here’s what I found:

Operation Time (ms)
Open index (500M vectors) 195,572
Query (k=1000) 193
Insert (batch of 1000) 114

The index loads in ~3 minutes, but once it’s in memory, query times stay under 200ms for large k.

Future Work

  • Better insert handling: Right now, inserting duplicate IDs just appends them again. A proper upsert mechanism would help.
  • Support for larger vector types: Floating point vectors are doable but require a different distance metric.
  • Distributed indexing: Scaling across multiple GPUs would enable even larger datasets.

Conclusion

This project was an experiment in building high-speed, GPU-accelerated retrieval systems. It runs fast, scales well, and integrates cleanly via an API. The next step is deciding whether to optimize further or generalize it for other vector types.