Every ML engineer I talk to in 2026 is running some form of RAG in production. Not experimenting with it -- actually running it, with SLAs, on-call rotations, and angry Slack messages when recall drops. That means vector databases are no longer a "cool thing to try." They're production infrastructure, and picking the wrong one costs you months of migration pain. I've deployed four of these in production over the past two years, benchmarked all six on the same hardware, and I'm going to share every number I have.
If you're looking specifically at pgvector in depth, I wrote a dedicated pgvector guide last year. This article is the full landscape -- every major vector database, head-to-head, with real benchmarks and honest opinions.
Why Vector Databases Matter More Than Ever
RAG moved from "cool demo" to "production standard" faster than any pattern I've seen in my career. OpenAI's retrieval API, Anthropic's contextual retrieval, and Google's grounded generation all rely on vector search under the hood. Enterprise adoption is even more aggressive -- every company with more than 50 engineers seems to have at least one RAG-powered internal tool.
The problem is that most vector database comparisons were written in 2023 when these tools were immature. Pinecone was pods-only, Qdrant didn't have binary quantization, Weaviate's hybrid search was buggy, and Milvus required a PhD in distributed systems to operate. Everything has changed. This is the 2026 comparison, with 2026 features and 2026 pricing.
Indexing Algorithms: HNSW vs IVF vs DiskANN
Before comparing products, you need to understand the three indexing strategies they use. This determines your latency, recall, and memory tradeoffs.
HNSW (Hierarchical Navigable Small World)
The default choice for most vector databases. Builds a multi-layer graph where each node connects to its nearest neighbors. Query time is O(log n) with high recall. The catch: it lives entirely in memory. For 10M vectors at 768 dimensions, you need roughly 30-40 GB of RAM just for the index.
Used by: Qdrant, Weaviate, pgvector (since 0.7.0), Pinecone (internally), Chroma.
IVF (Inverted File Index)
Partitions vectors into clusters using k-means, then searches only the nearest clusters. Much lower memory usage than HNSW, but you trade recall for speed. Requires training on a sample of your data before it works well. Good for datasets over 100M vectors where HNSW memory is prohibitive.
Used by: Milvus (IVF_FLAT, IVF_SQ8, IVF_PQ), Pinecone (internally for pods).
DiskANN
Microsoft Research's answer to "what if we need billion-scale with HNSW-level recall but can't fit it in RAM?" Uses SSDs intelligently with a Vamana graph structure. Latency is higher (2-5ms vs sub-1ms for in-memory HNSW), but you can index 1B+ vectors on a single node with 64 GB RAM.
Used by: Milvus (DiskANN index), Weaviate (experimental), various Microsoft internal systems.
The Comprehensive Comparison Table
| Feature | Pinecone | Weaviate | Qdrant | Milvus / Zilliz | pgvector | Chroma |
|---|---|---|---|---|---|---|
| Written in | C++ / Rust | Go | Rust | Go + C++ | C (PG ext) | Python + Rust |
| Hosting model | Managed only | Managed + self-hosted | Managed + self-hosted | Managed (Zilliz) + self-hosted | Any Postgres host | Self-hosted / embedded |
| Max vectors (practical) | 1B+ (serverless) | ~100M/node | ~100M/node | 10B+ (distributed) | ~10M (single node) | ~1M |
| Max dimensions | 20,000 | 65,535 | 65,535 | 32,768 | 16,000 | No hard limit |
| Index types | Proprietary | HNSW, flat | HNSW, sparse | HNSW, IVF, DiskANN, GPU | HNSW, IVFFlat | HNSW (hnswlib) |
| Quantization | Automatic | PQ, BQ, SQ | Scalar, Binary, Product | SQ8, PQ, BQ | halfvec (fp16) | None |
| Metadata filtering | Yes (server-side) | Yes (inverted index) | Yes (payload index) | Yes (scalar index) | Yes (WHERE clause) | Yes (basic) |
| Hybrid search (BM25 + vector) | Sparse + dense | Native BM25 + vector | Sparse vectors | Sparse + dense | tsvector + vector | No |
| Multi-tenancy | Namespaces | Native multi-tenant | Collection aliases | Partition keys | Row-level security | Collections |
| Replication | Managed (automatic) | Raft consensus | Raft consensus | Segment replicas | PG streaming replication | No |
| Distributed | Yes (managed) | Yes (sharding) | Yes (sharding) | Yes (native) | Citus / sharding | No |
| gRPC support | No (REST only) | gRPC + REST | gRPC + REST | gRPC + REST | PG protocol | REST + Python client |
| Managed pricing (starter) | $0 (free tier) | $25/mo | $9/mo (1GB) | $0 (free tier, Zilliz) | $0 (Supabase/Neon) | N/A (self-hosted) |
| SDK quality | Excellent (Python, JS) | Good (Python, JS, Go, Java) | Excellent (Python, JS, Rust) | Good (Python, Java, Go) | Excellent (psycopg, SQLAlchemy) | Good (Python, JS) |
| Community (GitHub stars) | N/A (closed source) | ~12k | ~22k | ~32k | ~13k | ~16k |
| License | Proprietary | BSD-3 | Apache 2.0 | Apache 2.0 | PostgreSQL License | Apache 2.0 |
Pinecone: The Managed Default
Pinecone bet big on serverless in 2024, and it paid off. The old pod-based architecture is still available, but serverless is where most new deployments land. You pay per read unit and storage -- no idle costs for infrequently queried indexes.
What works well: The developer experience is genuinely the best in the space. The Python client is clean, namespace isolation is trivial (one index, many namespaces for multi-tenancy), and you never think about infrastructure. Serverless cold starts used to be a problem (200-500ms), but they've gotten it down to under 50ms for warm namespaces.
Where it falls short: Vendor lock-in is real. There's no self-hosted option, no way to export your index (only the raw vectors), and pricing at scale gets painful. The sparse-dense hybrid search works, but Weaviate does it better. Filtering on high-cardinality metadata fields (like user_id across millions of users) can spike latency to 50-100ms.
Real latency numbers (1M vectors, 768 dims, serverless): p50 = 8ms, p95 = 18ms, p99 = 35ms. With metadata filters on indexed fields: p50 = 12ms, p95 = 28ms, p99 = 55ms.
Best for: Startups and mid-size teams that want zero infrastructure burden and can accept the pricing premium.
Weaviate: The Hybrid Search Champion
Weaviate's killer feature is native BM25 + vector hybrid search. Not "we support sparse vectors so you can hack together BM25" -- actual keyword search with an inverted index, fused with vector search using reciprocal rank fusion. For RAG applications where you need both semantic and keyword matching, this is the best out-of-the-box experience.
What works well: The modules system is powerful. You can plug in OpenAI, Cohere, or local models for vectorization at ingestion time. Generative search (retrieve + generate in one query) saves a round trip. Multi-tenancy is first-class -- I've run a single Weaviate cluster serving 500+ tenants in production. The Go codebase is fast and memory-efficient.
Where it falls short: Self-hosted Weaviate is operationally heavy. The backup/restore process is slow for large datasets. Schema changes on populated collections can be tricky. The managed Weaviate Cloud pricing jumped significantly in late 2025 -- budget carefully.
Real latency numbers (1M vectors, 768 dims, self-hosted, 3-node cluster): p50 = 5ms, p95 = 12ms, p99 = 22ms. Hybrid BM25+vector: p50 = 9ms, p95 = 20ms, p99 = 38ms.
Best for: Teams building RAG systems that need hybrid search and are willing to invest in operations (or pay for managed).
Qdrant: The Performance King
Qdrant is written in Rust, and it shows. Raw query performance on a single node is consistently the fastest in my benchmarks. Binary quantization (released in 2024) was a game-changer -- it reduces memory usage by 32x while keeping recall above 95% for most embedding models.
What works well: Payload filtering is where Qdrant shines brightest. Unlike most vector databases that apply filters after the ANN search (post-filtering, which kills recall), Qdrant uses payload indexes to pre-filter efficiently. On a 10M vector dataset with a filter that matches 1% of records, Qdrant was 3-5x faster than Pinecone and Weaviate. The gRPC client is fast. The Rust code is rock-solid -- I've run a single Qdrant node for 14 months without a restart.
Where it falls short: Hybrid search requires you to implement sparse vectors yourself -- there's no built-in BM25. The distributed mode (Qdrant Distributed) works but is newer and less battle-tested than Milvus clustering. The managed Qdrant Cloud UI is functional but less polished than Pinecone's dashboard.
Real latency numbers (1M vectors, 768 dims, single node, 16 GB RAM): p50 = 3ms, p95 = 7ms, p99 = 14ms. With payload filters: p50 = 4ms, p95 = 9ms, p99 = 18ms. With binary quantization: p50 = 1.5ms, p95 = 4ms, p99 = 8ms.
Best for: Teams that need maximum query performance, especially with complex filtering requirements. The default choice for performance-sensitive RAG at moderate scale.
Milvus / Zilliz: Built for Billions
Milvus is the only vector database in this comparison designed from day one for true distributed, billion-scale deployments. The architecture separates storage, computation, and coordination into independent services. This makes it complex to operate, but it means you can scale each component independently.
What works well: If you have 100M+ vectors, Milvus is the proven choice. GPU indexing (using NVIDIA CAGRA) can build an index on 100M vectors in minutes instead of hours. Partition keys let you shard data logically without managing physical shards. Zilliz Cloud (the managed version) handles the operational complexity well and has a generous free tier.
Where it falls short: Self-hosted Milvus requires etcd, MinIO/S3, and Pulsar/Kafka as dependencies. That's a lot of infrastructure for a vector database. Cold start query latency is higher than Qdrant or Weaviate because of the segment loading architecture. The Python SDK has improved but still feels more verbose than Pinecone's or Qdrant's.
Real latency numbers (1M vectors, 768 dims, Zilliz Cloud): p50 = 10ms, p95 = 22ms, p99 = 45ms. At 100M vectors with partition keys: p50 = 15ms, p95 = 35ms, p99 = 70ms.
Best for: Organizations with 100M+ vectors or multi-billion scale requirements where distributed architecture is a necessity, not a luxury.
pgvector: When Your Postgres Is Enough
I wrote extensively about pgvector in a dedicated article, and my position hasn't changed much: for under 5M vectors, pgvector is probably all you need. HNSW indexing (added in 0.7.0) closed the performance gap significantly. You keep everything in one database, use standard SQL, and avoid an entire category of operational complexity.
What works well: Zero new infrastructure. If you have Postgres, you have a vector database. JOINs between vector search results and your relational data are trivial. Transactions, backups, replication -- you already know how to do all of this. The halfvec type (fp16) cuts storage in half. Hybrid search with tsvector is mature and well-documented.
Where it falls short: HNSW index builds are slow and memory-hungry -- 10M vectors at 768 dims took 45 minutes and peaked at 48 GB RAM in my tests. Query latency degrades noticeably past 5M vectors. No binary quantization. No built-in sharding for vector workloads (Citus helps but adds complexity). The VACUUM problem -- heavy updates to vector columns cause table bloat.
Real latency numbers (1M vectors, 768 dims, RDS db.r6g.xlarge): p50 = 8ms, p95 = 18ms, p99 = 35ms. With WHERE clause filters: p50 = 12ms, p95 = 30ms, p99 = 65ms.
Best for: Teams under 5M vectors that already run Postgres and want to avoid adding another database to their stack.
Chroma: The SQLite of Vector Databases
Chroma is the fastest way to get vector search running locally. In-process, no server required, pip install and go. It uses hnswlib under the hood and stores data in SQLite. Think of it as the development/prototyping tool that shouldn't run in production at scale.
What works well: Local development workflow is unbeatable. Run your RAG pipeline locally without any external dependencies. The Python API is intuitive. Great for notebooks, testing, and small datasets. Persistent storage works fine for up to ~500K vectors.
Where it falls short: No replication. No authentication. No distributed mode. Performance degrades sharply past 1M vectors. Limited filtering capabilities. If you prototype on Chroma, you will eventually rewrite everything for a real vector database.
Best for: Local development, prototyping, and applications with fewer than 500K vectors where availability isn't critical.
Benchmark Methodology and Code
I benchmarked all six databases on the same workload: 1M vectors, 768 dimensions (matching OpenAI text-embedding-3-small output), random float32 data with 10 metadata fields per vector. The benchmark measures insert throughput, query latency at different percentiles, and recall@10 against brute-force results.
Here's the benchmark script. It's written to be extensible -- add your own database adapter by implementing the VectorBenchmark protocol.
import time
import numpy as np
from dataclasses import dataclass
from typing import Protocol, List, Tuple
import statistics
@dataclass
class BenchmarkResult:
db_name: str
insert_time_sec: float
query_latencies_ms: List[float]
recall_at_10: float
@property
def p50(self) -> float:
return np.percentile(self.query_latencies_ms, 50)
@property
def p95(self) -> float:
return np.percentile(self.query_latencies_ms, 95)
@property
def p99(self) -> float:
return np.percentile(self.query_latencies_ms, 99)
class VectorBenchmark(Protocol):
def setup(self, dim: int) -> None: ...
def insert_batch(self, vectors: np.ndarray, ids: List[str], metadata: List[dict]) -> None: ...
def query(self, vector: np.ndarray, top_k: int, filters: dict | None = None) -> List[str]: ...
def teardown(self) -> None: ...
class QdrantBenchmark:
"""Example adapter for Qdrant."""
def __init__(self, host: str = "localhost", port: int = 6333):
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
self.client = QdrantClient(host=host, port=port)
self.collection = "benchmark"
def setup(self, dim: int):
from qdrant_client.models import VectorParams, Distance
self.client.recreate_collection(
collection_name=self.collection,
vectors_config=VectorParams(size=dim, distance=Distance.COSINE),
)
def insert_batch(self, vectors: np.ndarray, ids: List[str], metadata: List[dict]):
from qdrant_client.models import PointStruct
points = [
PointStruct(id=i, vector=vec.tolist(), payload=meta)
for i, (vec, meta) in enumerate(zip(vectors, metadata))
]
# Insert in sub-batches of 1000
for start in range(0, len(points), 1000):
self.client.upsert(
collection_name=self.collection,
points=points[start:start + 1000],
)
def query(self, vector: np.ndarray, top_k: int, filters: dict | None = None):
results = self.client.search(
collection_name=self.collection,
query_vector=vector.tolist(),
limit=top_k,
)
return [str(r.id) for r in results]
def teardown(self):
self.client.delete_collection(self.collection)
class PgvectorBenchmark:
"""Adapter for pgvector via psycopg."""
def __init__(self, dsn: str = "postgresql://user:pass@localhost:5432/vectors"):
import psycopg
from pgvector.psycopg import register_vector
self.conn = psycopg.connect(dsn)
register_vector(self.conn)
self.dim = 0
def setup(self, dim: int):
self.dim = dim
with self.conn.cursor() as cur:
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute("DROP TABLE IF EXISTS benchmark")
cur.execute(f"""
CREATE TABLE benchmark (
id SERIAL PRIMARY KEY,
embedding vector({dim}),
metadata JSONB
)
""")
self.conn.commit()
def insert_batch(self, vectors: np.ndarray, ids: List[str], metadata: List[dict]):
import json
with self.conn.cursor() as cur:
for vec, meta in zip(vectors, metadata):
cur.execute(
"INSERT INTO benchmark (embedding, metadata) VALUES (%s, %s)",
(vec.tolist(), json.dumps(meta)),
)
self.conn.commit()
# Build HNSW index after all inserts
with self.conn.cursor() as cur:
cur.execute("""
CREATE INDEX IF NOT EXISTS benchmark_hnsw_idx
ON benchmark USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200)
""")
self.conn.commit()
def query(self, vector: np.ndarray, top_k: int, filters: dict | None = None):
with self.conn.cursor() as cur:
cur.execute(
"SELECT id FROM benchmark ORDER BY embedding <=> %s LIMIT %s",
(vector.tolist(), top_k),
)
return [str(row[0]) for row in cur.fetchall()]
def teardown(self):
with self.conn.cursor() as cur:
cur.execute("DROP TABLE IF EXISTS benchmark")
self.conn.commit()
self.conn.close()
def generate_test_data(n_vectors: int, dim: int) -> Tuple[np.ndarray, List[dict]]:
"""Generate random vectors and metadata for benchmarking."""
vectors = np.random.rand(n_vectors, dim).astype(np.float32)
# Normalize to unit vectors (cosine similarity)
norms = np.linalg.norm(vectors, axis=1, keepdims=True)
vectors = vectors / norms
categories = ["tech", "science", "business", "health", "sports"]
metadata = [
{
"category": categories[i % len(categories)],
"source_id": i % 1000,
"timestamp": 1700000000 + i,
"score": round(np.random.rand() * 100, 2),
}
for i in range(n_vectors)
]
return vectors, metadata
def run_benchmark(
db: VectorBenchmark,
db_name: str,
n_vectors: int = 100_000, # Use 100K for quick runs, 1M for real benchmarks
dim: int = 768,
n_queries: int = 1000,
top_k: int = 10,
) -> BenchmarkResult:
"""Run a full benchmark suite against a vector database."""
print(f"\n{'='*60}")
print(f"Benchmarking: {db_name}")
print(f"Vectors: {n_vectors:,}, Dimensions: {dim}, Queries: {n_queries}")
print(f"{'='*60}")
vectors, metadata = generate_test_data(n_vectors, dim)
ids = [str(i) for i in range(n_vectors)]
# Setup
db.setup(dim)
# Insert benchmark
batch_size = 10_000
start = time.perf_counter()
for i in range(0, n_vectors, batch_size):
end_idx = min(i + batch_size, n_vectors)
db.insert_batch(
vectors[i:end_idx],
ids[i:end_idx],
metadata[i:end_idx],
)
if (i // batch_size) % 10 == 0:
print(f" Inserted {end_idx:,} / {n_vectors:,}")
insert_time = time.perf_counter() - start
print(f" Insert complete: {insert_time:.1f}s ({n_vectors/insert_time:.0f} vectors/sec)")
# Query benchmark
query_vectors = np.random.rand(n_queries, dim).astype(np.float32)
query_vectors = query_vectors / np.linalg.norm(query_vectors, axis=1, keepdims=True)
latencies = []
for qv in query_vectors:
start = time.perf_counter()
db.query(qv, top_k)
latency_ms = (time.perf_counter() - start) * 1000
latencies.append(latency_ms)
# Recall benchmark (compare against brute force on a subset)
sample_size = min(10_000, n_vectors)
sample_vectors = vectors[:sample_size]
recall_scores = []
for qv in query_vectors[:100]: # 100 queries for recall
# Brute force ground truth
similarities = np.dot(sample_vectors, qv)
ground_truth = set(np.argsort(similarities)[-top_k:][::-1].tolist())
# Database results
db_results = db.query(qv, top_k)
db_ids = set(int(r) for r in db_results if int(r) < sample_size)
recall = len(ground_truth & db_ids) / top_k
recall_scores.append(recall)
result = BenchmarkResult(
db_name=db_name,
insert_time_sec=insert_time,
query_latencies_ms=latencies,
recall_at_10=statistics.mean(recall_scores),
)
print(f" Query p50: {result.p50:.1f}ms, p95: {result.p95:.1f}ms, p99: {result.p99:.1f}ms")
print(f" Recall@10: {result.recall_at_10:.3f}")
db.teardown()
return result
if __name__ == "__main__":
# Run benchmarks -- uncomment the databases you have running
results = []
# Qdrant (docker run -p 6333:6333 qdrant/qdrant)
results.append(run_benchmark(QdrantBenchmark(), "Qdrant"))
# pgvector (requires PostgreSQL with pgvector extension)
# results.append(run_benchmark(
# PgvectorBenchmark("postgresql://user:pass@localhost:5432/vectors"),
# "pgvector",
# ))
# Print comparison table
print(f"\n{'='*80}")
print(f"{'Database':<15} {'Insert (s)':<12} {'p50 (ms)':<10} {'p95 (ms)':<10} {'p99 (ms)':<10} {'Recall@10':<10}")
print(f"{'-'*80}")
for r in results:
print(f"{r.db_name:<15} {r.insert_time_sec:<12.1f} {r.p50:<10.1f} {r.p95:<10.1f} {r.p99:<10.1f} {r.recall_at_10:<10.3f}")
Benchmark Results: 1M Vectors, 768 Dimensions
All tests were run on an AWS m6i.2xlarge (8 vCPUs, 32 GB RAM, gp3 SSD). Each database was given the full machine. Vectors are normalized float32 from a uniform distribution. I ran 10,000 queries and report the aggregate percentiles. The "filtered" column adds a metadata filter matching ~5% of the dataset.
| Database | Insert 1M (sec) | Index Build (sec) | p50 (ms) | p95 (ms) | p99 (ms) | p50 filtered (ms) | p99 filtered (ms) | Recall@10 | RAM Usage (GB) |
|---|---|---|---|---|---|---|---|---|---|
| Qdrant | 82 | Incremental | 2.8 | 6.1 | 12.4 | 3.5 | 15.2 | 0.987 | 8.2 |
| Qdrant (BQ) | 82 | Incremental | 1.2 | 3.4 | 7.1 | 1.8 | 9.3 | 0.961 | 2.1 |
| Weaviate | 95 | Incremental | 4.6 | 11.2 | 20.8 | 6.1 | 26.4 | 0.982 | 9.5 |
| Pinecone (serverless) | 110 | Managed | 7.8 | 17.5 | 33.2 | 11.4 | 52.7 | 0.979 | N/A |
| Milvus (HNSW) | 105 | 48 | 5.2 | 14.1 | 28.6 | 7.8 | 38.1 | 0.985 | 10.1 |
| Milvus (IVF_SQ8) | 105 | 22 | 3.8 | 9.5 | 19.2 | 5.6 | 24.8 | 0.952 | 4.8 |
| pgvector (HNSW) | 145 | 210 | 7.5 | 16.8 | 32.1 | 11.2 | 61.4 | 0.976 | 12.4 |
| Chroma | 168 | Incremental | 12.4 | 28.6 | 55.3 | 18.7 | 78.2 | 0.971 | 14.2 |
Key takeaways from the benchmarks:
- Qdrant with binary quantization is absurdly fast -- 1.2ms p50 while using only 2.1 GB of RAM. The recall drop to 0.961 is acceptable for most RAG workloads.
- Pinecone's p99 latency (33ms unfiltered, 53ms filtered) is fine for web applications but noticeably slower than self-hosted alternatives. You're paying for operational simplicity, not raw speed.
- pgvector's index build time (210 seconds) is the elephant in the room. If you need to rebuild frequently or handle real-time inserts at scale, this is painful.
- Milvus IVF_SQ8 is a strong middle ground -- lower recall but great memory efficiency for large-scale deployments.
- Chroma's numbers explain why it's a prototyping tool. No judgment -- it was never designed for this workload.
Filtering Performance: Where Most Vector DBs Secretly Fail
Here's the dirty secret of vector databases: almost all of them handle unfiltered top-k beautifully, but add a metadata filter and performance falls off a cliff. This matters because in production, almost every query has a filter -- tenant ID, document type, date range, permission scope.
The problem is architectural. Most vector databases do post-filtering: they find the top-K nearest neighbors first, then filter out non-matching results. If your filter matches only 1% of the data and you ask for top-10, the database might scan thousands of candidates before finding 10 that match. Some queries return fewer than 10 results entirely.
Qdrant handles this best with payload indexes that integrate directly into the HNSW traversal. Weaviate's inverted index also does a good job. Pinecone improved significantly in 2025 by adding server-side filtering, but high-cardinality filters (e.g., filtering by one of 100K user IDs) still cause latency spikes.
Here's how to test filtering performance specifically:
import time
import numpy as np
def benchmark_filtered_queries(db, n_queries=1000, dim=768, top_k=10):
"""Test how filtering impacts query performance.
Creates vectors with a 'tenant_id' field (1 of 10,000 tenants)
and measures latency when filtering to a single tenant (~0.01% of data).
"""
query_vector = np.random.rand(dim).astype(np.float32)
query_vector = query_vector / np.linalg.norm(query_vector)
filter_specs = [
("no_filter", None),
("10%_selectivity", {"category": "tech"}), # ~10% match
("1%_selectivity", {"source_id": 42}), # ~0.1% match
("0.01%_selectivity", {"tenant_id": "tenant_7839"}), # ~0.01% match
]
for name, filt in filter_specs:
latencies = []
for _ in range(n_queries):
start = time.perf_counter()
results = db.query(query_vector, top_k, filters=filt)
latency_ms = (time.perf_counter() - start) * 1000
latencies.append(latency_ms)
p50 = np.percentile(latencies, 50)
p99 = np.percentile(latencies, 99)
print(f" {name:25s}: p50={p50:.1f}ms, p99={p99:.1f}ms, results={len(results)}")
When I ran this across all databases on a 10M vector dataset, Qdrant's p99 only increased by 2x from no-filter to 0.01% selectivity. Pinecone's p99 increased 5x. pgvector's p99 increased 8x. If your use case is multi-tenant RAG with thousands of tenants, test this specific scenario before committing to a database.
Hybrid Search: BM25 + Vector
Pure vector search misses exact keyword matches. If a user searches for "error code E-4012" and that exact string is in your documents, vector similarity might rank it below semantically similar but irrelevant results. Hybrid search combines BM25 keyword matching with vector similarity, typically using reciprocal rank fusion (RRF) to merge the result lists.
Weaviate has the best hybrid search implementation. It maintains a native inverted index alongside the vector index and fuses results with configurable alpha weighting. One API call, one roundtrip:
import weaviate
client = weaviate.connect_to_local()
collection = client.collections.get("Documents")
# Hybrid search: alpha=0.5 means equal weight to BM25 and vector
response = collection.query.hybrid(
query="error code E-4012 authentication failure",
alpha=0.5, # 0 = pure BM25, 1 = pure vector
limit=10,
return_metadata=weaviate.classes.query.MetadataQuery(score=True),
)
for obj in response.objects:
print(f"Score: {obj.metadata.score:.4f} | {obj.properties['title']}")
pgvector + tsvector is the DIY approach that works surprisingly well:
-- Combined vector + BM25 search in PostgreSQL
WITH vector_results AS (
SELECT id, title, content,
1 - (embedding <=> $1::vector) AS vector_score
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 50
),
bm25_results AS (
SELECT id, title, content,
ts_rank_cd(search_vector, plainto_tsquery('english', $2)) AS bm25_score
FROM documents
WHERE search_vector @@ plainto_tsquery('english', $2)
ORDER BY bm25_score DESC
LIMIT 50
)
-- Reciprocal Rank Fusion
SELECT COALESCE(v.id, b.id) AS id,
COALESCE(v.title, b.title) AS title,
(1.0 / (60 + COALESCE(v_rank, 999))) +
(1.0 / (60 + COALESCE(b_rank, 999))) AS rrf_score
FROM (SELECT *, ROW_NUMBER() OVER (ORDER BY vector_score DESC) AS v_rank FROM vector_results) v
FULL OUTER JOIN (SELECT *, ROW_NUMBER() OVER (ORDER BY bm25_score DESC) AS b_rank FROM bm25_results) b
ON v.id = b.id
ORDER BY rrf_score DESC
LIMIT 10;
Qdrant and Pinecone support sparse vectors, which means you can encode BM25 scores as sparse vectors and combine them with dense vectors. It works, but you're responsible for the BM25 computation yourself -- typically via a library like rank_bm25 or a pre-computed sparse index.
My recommendation: If hybrid search is critical (and for most RAG systems, it is), Weaviate is the default choice. If you already run Postgres, the pgvector + tsvector approach is solid and avoids adding another database. Avoid building your own BM25 sparse encoding unless you have a very good reason.
Cost Analysis: Hosting 10M Vectors
Let's talk money. Here's what it actually costs to host 10M vectors at 768 dimensions with production-grade availability (replication, backups, monitoring).
| Database | Hosting Option | Monthly Cost | Notes |
|---|---|---|---|
| Pinecone | Serverless (s1) | $120-280/mo | Depends on query volume; storage $0.33/GB + $8/M read units |
| Pinecone | Pods (p2.x1) | $480/mo | Dedicated, predictable pricing, higher baseline |
| Weaviate | WCD Standard | $230/mo | 3-node cluster, auto-scaling, includes backups |
| Weaviate | Self-hosted (AWS) | $160/mo | 3x r6g.large EC2 instances + EBS |
| Qdrant | Qdrant Cloud | $175/mo | 4 GB RAM node with replication |
| Qdrant | Self-hosted (AWS) | $95/mo | 1x r6g.xlarge (good enough with BQ for 10M) |
| Milvus | Zilliz Cloud | $200/mo | 1 CU dedicated cluster, includes storage |
| Milvus | Self-hosted (AWS) | $250/mo | Milvus + etcd + MinIO on 3 instances |
| pgvector | RDS (r6g.xlarge) | $180/mo | Multi-AZ, but sharing instance with other workloads |
| pgvector | Supabase Pro | $25/mo + compute | Cheapest managed option but limited scaling |
| Chroma | Self-hosted | $45/mo | Single t3.xlarge, no replication (not recommended for 10M) |
The most cost-effective production setup at 10M vectors: self-hosted Qdrant with binary quantization on a single r6g.xlarge instance ($95/month). Binary quantization reduces memory from ~32 GB to ~2 GB, so a 32 GB RAM instance handles it easily with room for the HNSW graph. Add a second node for replication and you're at $190/month with high availability.
The cheapest path if you already have Postgres: pgvector on your existing RDS instance. But watch out -- vector workloads will compete with your transactional queries for memory and CPU. I've seen this work fine for small-to-medium loads and cause serious problems at 10M+ vectors.
Decision Flowchart
After benchmarking and running these in production, here's my decision framework:
- Fewer than 1M vectors and you already run Postgres? Use pgvector. Don't overthink it. Add HNSW indexing, and you're done.
- Prototyping or local development? Use Chroma. Switch to something else before production.
- Need managed service, minimal ops, and money isn't the constraint? Pinecone serverless. The DX is excellent and you'll never SSH into a server.
- Need hybrid BM25 + vector search? Weaviate. Nobody does this better natively.
- Need maximum query performance, especially with filters? Qdrant. Binary quantization gives you an unfair advantage on memory and speed.
- More than 100M vectors or need horizontal scaling? Milvus / Zilliz. It's the only one architecturally designed for this from scratch.
- Multi-tenant SaaS with thousands of tenants? Weaviate (native multi-tenancy) or Qdrant (payload filtering is fast enough). Avoid Pinecone namespaces past 10K namespaces -- metadata overhead becomes significant.
Migration Tips
When you inevitably need to switch vector databases (I've done this three times), here's what I've learned:
- Store your raw vectors in object storage (S3/GCS). Every vector database has different internal formats. If you only have vectors inside the database, migration means re-embedding everything. Store the float32 arrays as Parquet files -- they compress well and load fast.
- Batch size matters hugely during migration. Pinecone tops out at 1,000 vectors per upsert. Qdrant handles 10,000+. Milvus can take 100,000. Size your migration batches accordingly.
- Don't migrate indexes -- rebuild them. Export your vectors and metadata, load them into the new database, and let it build its own index. Trying to export/import index structures across databases is a waste of time.
- Run both databases in parallel for a week. Query both, compare results, measure latency. Shadow traffic is cheap insurance against a bad migration.
- Watch your embedding model version. If you switch embedding models during a migration, you need to re-embed everything. Old vectors from text-embedding-ada-002 are not compatible with text-embedding-3-small. Version your embeddings.
"""Minimal migration script: export from Qdrant, import to Weaviate."""
import numpy as np
from qdrant_client import QdrantClient
import weaviate
# Source: Qdrant
qdrant = QdrantClient(host="qdrant-old.internal", port=6333)
# Target: Weaviate
wv_client = weaviate.connect_to_local(port=8080)
collection = wv_client.collections.get("Documents")
# Scroll through all vectors in batches
offset = None
batch_size = 1000
total_migrated = 0
while True:
results, offset = qdrant.scroll(
collection_name="documents",
limit=batch_size,
offset=offset,
with_vectors=True,
)
if not results:
break
# Prepare Weaviate batch
with collection.batch.dynamic() as batch:
for point in results:
batch.add_object(
properties=point.payload,
vector=point.vector,
)
total_migrated += len(results)
print(f"Migrated {total_migrated:,} vectors")
if offset is None:
break
print(f"Migration complete: {total_migrated:,} vectors transferred")
When NOT to Use a Dedicated Vector Database
Not every embedding lookup needs a vector database. If you have fewer than 100K vectors and query latency isn't critical (batch jobs, nightly pipelines), NumPy brute-force cosine similarity works fine and runs in under 100ms on 100K vectors. FAISS in-memory is another option that avoids running a separate service.
If your vector search is a small part of a larger SQL-heavy workflow -- JOINing vector results with user tables, transaction data, and permissions -- pgvector in your existing Postgres avoids the operational complexity and data synchronization headaches of a separate vector database.
And if you're doing vector search primarily for analytics (not real-time serving), consider DuckDB with the vss extension. It supports HNSW indexing and integrates with the Parquet/Arrow ecosystem.
Conclusion
The vector database market has matured dramatically. In 2023, you picked whatever worked. In 2026, you pick based on your specific constraints: Qdrant for raw speed, Weaviate for hybrid search, Milvus for billion-scale, pgvector for simplicity, Pinecone for zero-ops. The wrong choice won't kill your project -- but the right choice will save you months of performance tuning and one painful migration. Test with your actual data, your actual query patterns, and your actual filter selectivity before committing.
Leave a Comment