Skip to content

Latest commit

 

History

History
438 lines (322 loc) · 24.9 KB

README.md

File metadata and controls

438 lines (322 loc) · 24.9 KB

VectorChord

Effortlessly host 100 million 768-dimensional vectors (250GB+) on an AWS i4i.xlarge instance ($250/month), featuring 4 vCPUs and 32GB of RAM with VectorChord.

Note

VectorChord serves as the successor to pgvecto.rs with better stability and performance. If you are interested in this new solution, you may find the migration guide helpful.

VectorChord (vchord) is a PostgreSQL extension designed for scalable, high-performance, and disk-efficient vector similarity search.

With VectorChord, you can store 400,000 vectors for just $1, enabling significant savings: 6x more vectors compared to Pinecone's optimized storage and 26x more than pgvector/pgvecto.rs for the same price1.

Features

VectorChord introduces remarkable enhancements over pgvecto.rs and pgvector:

⚡ Enhanced Performance: Delivering optimized operations with up to 5x faster queries, 16x higher insert throughput, and 16x quicker1 index building compared to pgvector's HNSW implementation.

💰 Affordable Vector Search: Query 100M 768-dimensional vectors using just 32GB of memory, achieving 35ms P50 latency with top10 recall@95%, helping you keep infrastructure costs down while maintaining high search quality.

🔌 Seamless Integration: Fully compatible with pgvector data types and syntax while providing optimal defaults out of the box - no manual parameter tuning needed. Just drop in VectorChord for enhanced performance.

🔧 Accelerated Index Build: Leverage IVF to build indexes externally (e.g., on GPU) for faster KMeans clustering, combined with RaBitQ2 compression to efficiently store vectors while maintaining search quality through autonomous reranking.

📏 Long Vector Support: Store and search vectors up to 60,0003 dimensions, enabling the use of the best high-dimensional models like text-embedding-3-large with ease.

🌐 Scale As You Want: Based on horizontal expansion, the query of 5M / 100M 768-dimensional vectors can be easily scaled to 10000+ QPS with top10 recall@90% at a competitive cost4

Requirements

Tip

If you are using the official Docker image, you can skip this step.

VectorChord depends on pgvector, ensure the pgvector extension is available:

SELECT * FROM pg_available_extensions WHERE name = 'vector';

If pgvector is not available, install it using the pgvector installation instructions.

And make sure to add vchord.so to the shared_preload_libraries in postgresql.conf.

-- Add vchord and pgvector to shared_preload_libraries --
-- Note: A restart is required for this setting to take effect.
ALTER SYSTEM SET shared_preload_libraries = 'vchord.so';

Quick Start

For new users, we recommend using the Docker image to get started quickly.

docker run \
  --name vectorchord-demo \
  -e POSTGRES_PASSWORD=mysecretpassword \
  -p 5432:5432 \
  -d tensorchord/vchord-postgres:pg17-v0.2.1

Then you can connect to the database using the psql command line tool. The default username is postgres, and the default password is mysecretpassword.

psql -h localhost -p 5432 -U postgres

Now you can play with VectorChord!

Documentation

Installation

You can easily get the Docker image from:

docker pull tensorchord/vchord-postgres:pg17-v0.2.1

Debian and Ubuntu packages can be found on release page.

To install it:

wget https://github.com/tensorchord/VectorChord/releases/download/0.2.1/postgresql-17-vchord_0.2.1-1_amd64.deb
sudo apt install postgresql-17-vchord_*.deb

More Methods

VectorChord also supports other installation methods, including:

Usage

VectorChord depends on pgvector, including the vector representation. This way, we can keep the maximum compatibility of pgvector for both:

Since you can use them directly, your application can be easily migrated without pain!

Before all, you need to run the following SQL to ensure the extension is enabled.

CREATE EXTENSION IF NOT EXISTS vchord CASCADE;

It will install both pgvector and VectorChord, see requirements for more detail.

Storing

Similar to pgvector, you can create a table with vector column in VectorChord and insert some rows to it.

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
INSERT INTO items (embedding) SELECT ARRAY[random(), random(), random()]::real[] FROM generate_series(1, 1000);

Indexing

Similar to ivfflat, the index type of VectorChord, RaBitQ(vchordrq) also divides vectors into lists, and then searches a subset of those lists that are closest to the query vector. It inherits the advantages of ivfflat, such as fast build times and less memory usage, but has much better performance than hnsw and ivfflat.

The RaBitQ(vchordrq) index is supported on some pgvector types and metrics:

vector halfvec bit(n) sparsevec
L2 distance / <-> 🆖
inner product / <#> 🆖
cosine distance / <=> 🆖
L1 distance / <+> 🆖
Hamming distance/ <~> 🆖 🆖 🆖
Jaccard distance/ <%> 🆖 🆖 🆖

Where:

  • ✅ means supported by pgvector and VectorChord
  • ❌ means supported by pgvector but not by VectorChord
  • 🆖 means not planned by pgvector and VectorChord
  • 🔜 means supported by pgvector now and will be supported by VectorChord soon

To create the VectorChord RaBitQ(vchordrq) index, you can use the following SQL.

L2 distance

CREATE INDEX ON items USING vchordrq (embedding vector_l2_ops) WITH (options = $$
residual_quantization = true
[build.internal]
lists = [1000]
spherical_centroids = false
$$);

Note

  • Set residual_quantization to true and spherical_centroids to false for L2 distance
  • Use halfvec_l2_ops for halfvec
  • The recommend lists could be rows / 1000 for up to 1M rows and 4 * sqrt(rows) for over 1M rows

Inner product

CREATE INDEX ON items USING vchordrq (embedding vector_ip_ops) WITH (options = $$
residual_quantization = false
[build.internal]
lists = [1000]
spherical_centroids = true
$$);

Cosine distance

CREATE INDEX ON items USING vchordrq (embedding vector_cosine_ops) WITH (options = $$
residual_quantization = false
[build.internal]
lists = [1000]
spherical_centroids = true
$$);

Note

  • Set residual_quantization to false and spherical_centroids to true for inner product/cosine distance
  • Use vector_cosine_ops/vector_ip_ops for halfvec

Query

The query statement is exactly the same as pgvector. VectorChord supports any filter operation and WHERE/JOIN clauses like pgvecto.rs with VBASE.

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Supported distance functions are:

  • <-> - L2 distance
  • <#> - (negative) inner product
  • <=> - cosine distance

Performance Tuning

Index Build Time

Index building can parallelized, and with external centroid precomputation, the total time is primarily limited by disk speed. Optimize parallelism using the following settings:

-- Set this to the number of CPU cores available for parallel operations.
SET max_parallel_maintenance_workers = 8;
SET max_parallel_workers = 8;

-- Adjust the total number of worker processes. 
-- Note: A restart is required for this setting to take effect.
ALTER SYSTEM SET max_worker_processes = 8;

Query Performance

You can fine-tune the search performance by adjusting the probes and epsilon parameters:

-- Set probes to control the number of lists scanned. 
-- Recommended range: 3%–10% of the total `lists` value.
SET vchordrq.probes = 100;

-- Set epsilon to control the reranking precision.
-- Larger value means more rerank for higher recall rate and latency.
-- If you need a less precise query, set it to 1.0 might be appropriate.
-- Recommended range: 1.0–1.9. Default value is 1.9.
SET vchordrq.epsilon = 1.9;

-- vchordrq relies on a projection matrix to optimize performance.
-- Add your vector dimensions to the `prewarm_dim` list to reduce latency.
-- If this is not configured, the first query will have higher latency as the matrix is generated on demand.
-- Default value: '64,128,256,384,512,768,1024,1536'
-- Note: This setting requires a database restart to take effect.
ALTER SYSTEM SET vchordrq.prewarm_dim = '64,128,256,384,512,768,1024,1536';

And for postgres's setting

-- If using SSDs, set `effective_io_concurrency` to 200 for faster disk I/O.
SET effective_io_concurrency = 200;

-- Disable JIT (Just-In-Time Compilation) as it offers minimal benefit (1–2%) 
-- and adds overhead for single-query workloads.
SET jit = off;

-- Allocate at least 25% of total memory to `shared_buffers`. 
-- For disk-heavy workloads, you can increase this to up to 90% of total memory. You may also want to disable swap with network storage to avoid io hang.
-- Note: A restart is required for this setting to take effect.
ALTER SYSTEM SET shared_buffers = '8GB';

Advanced Features

Indexing prewarm

For disk-first indexing, RaBitQ(vchordrq) is loaded from disk for the first query, and then cached in memory if shared_buffer is sufficient, resulting in a significant cold-start slowdown.

To improve performance for the first query, you can try the following SQL that preloads the index into memory.

-- vchordrq_prewarm(index_name::regclass) to prewarm the index into the shared buffer
SELECT vchordrq_prewarm('gist_train_embedding_idx'::regclass)"

Indexing Progress

You can check the indexing progress by querying the pg_stat_progress_create_index view.

SELECT phase, round(100.0 * blocks_done / nullif(blocks_total, 0), 1) AS "%" FROM pg_stat_progress_create_index;

External Index Precomputation

Unlike pure SQL, an external index precomputation will first do clustering outside and insert centroids to a PostgreSQL table. Although it might be more complicated, external build is definitely much faster on larger dataset (>5M).

To get started, you need to do a clustering of vectors using faiss, scikit-learn or any other clustering library.

The centroids should be preset in a table of any name with 3 columns:

  • id(integer): id of each centroid, should be unique
  • parent(integer, nullable): parent id of each centroid, should be NULL for normal clustering
  • vector(vector): representation of each centroid, pgvector vector type

And example could be like this:

-- Create table of centroids
CREATE TABLE public.centroids (id integer NOT NULL UNIQUE, parent integer, vector vector(768));
-- Insert centroids into it
INSERT INTO public.centroids (id, parent, vector) VALUES (1, NULL, '{0.1, 0.2, 0.3, ..., 0.768}');
INSERT INTO public.centroids (id, parent, vector) VALUES (2, NULL, '{0.4, 0.5, 0.6, ..., 0.768}');
INSERT INTO public.centroids (id, parent, vector) VALUES (3, NULL, '{0.7, 0.8, 0.9, ..., 0.768}');
-- ...

-- Create index using the centroid table
CREATE INDEX ON gist_train USING vchordrq (embedding vector_l2_ops) WITH (options = $$
[build.external]
table = 'public.centroids'
$$);

To simplify the workflow, we provide end-to-end scripts for external index pre-computation, see scripts.

Range Query

To query vectors within a certain distance range, you can use the following syntax.

-- Query vectors within a certain distance range
SELECT vec FROM t WHERE vec <<->> sphere('[0.24, 0.24, 0.24]'::vector, 0.012) 
ORDER BY embedding <-> '[0.24, 0.24, 0.24]' LIMIT 5;

In this expression, vec <<->> sphere('[0.24, 0.24, 0.24]'::vector, 0.012) is equals to vec <-> '[0.24, 0.24, 0.24]' < 0.012. However, the latter one will trigger a exact nearest neighbor search as the grammar could not be pushed down.

Supported range functions are:

  • <<->> - L2 distance
  • <<#>> - (negative) inner product
  • <<=>> - cosine distance

Development

Build the Postgres Docker Image with VectorChord extension

Follow the steps in Dev Guidance.

Installing From Source

Install pgrx according to pgrx's instruction.

cargo install --locked cargo-pgrx
cargo pgrx init --pg17 $(which pg_config) # To init with system postgres, with pg_config in PATH
cargo pgrx install --release --sudo # To install the extension into the system postgres with sudo

Limitations

  • KMeans Clustering: The built-in KMeans clustering depends on multi-thread in-memory build and may require substantial memory. We strongly recommend using external centroid precomputation for efficient index construction.

License

This software is licensed under a dual license model:

  1. GNU Affero General Public License v3 (AGPLv3): You may use, modify, and distribute this software under the terms of the AGPLv3.

  2. Elastic License v2 (ELv2): You may also use, modify, and distribute this software under the Elastic License v2, which has specific restrictions.

You may choose either license based on your needs. We welcome any commercial collaboration or support, so please email us [email protected] with any questions or requests regarding the licenses.

Footnotes

  1. Based on MyScale Benchmark with 768-dimensional vectors and 95% recall. Please checkout our blog post for more details. 2

  2. Gao, Jianyang, and Cheng Long. "RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search." Proceedings of the ACM on Management of Data 2.3 (2024): 1-27.

  3. There is a limitation at pgvector of 16,000 dimensions now. If you really have a large dimension(16,000<dim<60,000), consider to change VECTOR_MAX_DIM and compile pgvector yourself.

  4. Please check our blog post for more details, the PostgreSQL scalability is powered by CloudNative-PG.