Vector Databases

Introduction

A vector database management system (VDMS) is a database that stores vectors - mathematical representations of data in high dimensional space. In addition to vector data, VDMS typically store a vector identifier, relevant attributes and sometimes the data the vector represents as well. VDMS implement nearest neighbour algorithms, such that the closest vectors to a query vector are returned.

Use Cases

One of the most common use cases of a VDMS is to reduce hallucinations in LLMs. This is because LLMs are trained on public and possibly biased datasets. These training datasets become outdated as the outside world is dynamic. To prevent hallucinations, VDMS are used to augment user prompts and answers. This process is called Retrieval-Augmented Generation (RAG), where the model refers to a trusted external source (e.g. dense vector index of Wikipedia) before generating an output. Alternatively, a vector index can also be used to cache similar prompts from the past (called a “semantic cache”). A semantic cache allows users to save on expensive API calls and bandwidth limitations. This is because the computing and storage requirements of a VDMS don’t require GPUs or TPUs.

Non-LLM related applications of a VDMS include video search and image search. Unstructured data, like images, are first encoded as a D-dimensional vector. To do so, features are extracted from images by passing images through a convolutional neural network. Likewise, for text, Word2vec is a group of neural-network models that create vectors from words in the natural language.

VDMS Internals

Similar to other databases, a VDMS also has a query processor and a storage engine. A query processor removes redundant parts of a query, and then finds the most efficient way of executing the query. The query itself is executed by the storage engine, which consists of components like a transaction manager, lock manager and the underlying storage structure. The design of the previous components depends on the nature of the application. For example, an LLM application is read heavy whereas an e-commerce application is write heavy. Each application will need a different index design.

Compared to a traditional database, there are a couple challenges when designing a VDMS. First, with structured data, users can provide boolean predicates (e.g. “SELECT * FROM customers where id = 1234”). However, vector queries rely on semantic similarity which are harder to state. Instead, A VDMS uses a similarity score (e.g. distance function between two vectors) for queries. Examples include Jaccard similarity (i.e. compares shared elements to number of distinct elements), Euclidean distance or the vector dot product. Second, comparisons are expensive. A similarity comparison requires O(D) time. Third, vector databases have larger storage requirements, and can span multiple data pages. Finally, unlike structured data, vectors are not sortable. Hence, it is harder to use existing index design principles for storing vectors.

Most VDMS support “nearest neighbor” queries (e.g. find the k closest vectors). Less common are predicated search queries (e.g. find the k closest vectors such that price < 100). Getting the exact k closest vectors has a complexity of O(DN), where D is vector dimension and N is the vector space. However, an approximate nearest neighbour search can be performed in sub-linear complexity.

There are three types of index structures that can be used in a VDMS. The first is a table; these structures divide the vector space in buckets containing similar vectors. Table-based indexes can use randomization (e.g. hash functions where similar inputs collide) or learned partitions (e.g. k-means clustering). The second type is trees, which are nestings of tables. The vector space is recursively split based on distance. The final structure is a graph, where a graph is overlaid on top of the vector space. Each node in the graph is a vector, and edges are the distances between the vectors. Hierarchical Navigable Small World (HNSW) is a graph index that performs approximate nearest neighbor (ANN) searches in many vector databases.

Predicated queries require query optimization. There are two approaches a VDMS can take. One option is to first block out the vectors that don’t match the predicate condition. An index scan can then be performed on the remaining vectors. Alternatively, a normal index scan occurs where each vector is checked against the predicate condition.

To save space locally, vectors can also be quantized. Quantization is passing the vector through a (often lossy) function to reduce its dimensionality. With quantization a VDMS saves storage space and speeds up queries at the cost of accuracy. When vectors are too large to fit on a single machine, they need to be partitioned across the entire fleet. Partitioning can happen along user defined attributes or along the vectors (e.g. k-means partitioning). Another partitioning strategy is to use consistent hashing.