-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider the distributed case #637
Comments
Hi, Thanks for your interest. We had some initial test using pgvecto.rs with distributed fdw and it works. I think citus is a similar case. And we're actually developing https://github.com/tensorchord/VectorChord, the next generation of pgvecto.rs. It's more scalable, more stable, and faster than pgvecto.rs. And with its disk friendly design, you can host 100M vector on a 4C32G machine. Therefore we believe for single machine, it should be easy to host 1B vectors. Citus can definitely help when user have larger data with high QPS requirement, but also will introduce performance penalty. You can check the experiment at https://jkatz05.com/post/postgres/distributed-pgvector/. Would you like to share your target scenario and data scale with us? |
I'm thinking precisely about improving qps, and the thoughts are quite abstract and imprecise:
Let's ground this in a more (or less) realistic example. Please correct me if my ballpark estimates are wrong, and more importantly if the logic is wrong. Also please keep in mind I just came up with the example dataset: The system is a short-format video recommender. The dataset is ~ 10m, the vector dim is ~1000. That's like 50g, so on a 32g machine with <20g or so to spare for caches, it will likely work 1 query at a time with occasional hangs when an unpopular part of the table is queried and it can't be read from the cache. Now, as we try to scale the qps, the occasional cache misses and subsequent evictions start hitting harder and harder, we quickly start getting most queries running in seconds as opposed to milliseconds. That's a positive feedback loop, we hit a wall Of course, we can scale up, but my main idea is that sooner or later scaling up is cumbersome, and doing so is more cumbersome than just dropping the pesky unpopular cluster of videos with welding tutorials or whatever. The goal of scaling horizontally, the way I'm imagining it, is to parallelize the lookup in a way that will decouple the funny dogs videos recommendations from welding tutorials. I am aware that by default in vectorstores all nodes return N recommendations and then they are reranked, but that's just a safe default. In principle there's no need to run the funny dog recommendations on the welding video shard (apart from the dimensionality curse but we're already knees deep in approximate indexes). So, if the system has a correspondence with ivfflat lists (or some hnsw alternative I didn't think through very well), we have performance savings since we're not straining all nodes and we definitely save cache by not evicting our dogs for the occasional welding tutorials. Do you find it reasonable at all? I feel like I'm successfully addressing the problem of a massive amount of neighbours in higher dimensions by waving hands and saying "that's what ivfflat is for so our recall doesn't suffer" but I might be completely off |
For postgres, you can easily add more replica for QPS requirement. For the distributed idea, I think it's valid and you get the point of ivfflat. But the problem is it's hard to make postgres work with that. We may need to hack into it's planner to skip specific shards. And it will definitely violate some default setting in postgres. Therefore I think it's more valid to build it as a specialized vector db. Inside postgres, the scale up solution is easier to achieve that can cover most user's requirement. |
That is incredibly sad! Do you think it's worth trying to use workarounds like a helper table with centroids and partition IDs? Postgres can definitely prune partitions when using a simple "where partition_id in (1, 2, 42)" |
I think there will literally be no other need to use any other vectorstore implementation if pgvecto.rs implements some level of smarts when it comes to sharding. My ideas include:
select cluster_id from cluster_management order by centroids <#> $input
So essentially I'm asking for a distributed index for sharding stuff with citus or alike. I'm not sure how big of a help I can be but I can surely try implementing it!
The text was updated successfully, but these errors were encountered: