PostgreSQL as a Vector Database: pgvector Is All You Need

Key Takeaways

  • pgvector turns PostgreSQL into a capable vector database that handles millions of embeddings with sub-50ms query latency when properly indexed.
  • HNSW indexes are the default choice for most workloads, offering better recall and query speed than IVFFlat at the cost of slower index builds and more memory.
  • Hybrid search combining vector similarity with full-text search is where pgvector truly shines — no other vector DB gives you this natively in a single query.
  • For datasets under ~10 million vectors, pgvector on a properly tuned PostgreSQL instance matches or beats dedicated vector databases on latency while costing significantly less.
  • You probably don't need a dedicated vector database if you're already running PostgreSQL. The operational simplicity of one fewer service in production is worth more than marginal latency gains.

Why I Ditched Pinecone (and Stopped Overthinking Vector Search)

Fourteen months ago, I was running a RAG pipeline for an internal knowledge base at my company. The architecture looked like what every tutorial recommends: OpenAI embeddings, Pinecone for vector storage and search, PostgreSQL for everything else — user data, documents metadata, audit logs, access control. Two databases, two sets of credentials, two failure modes, two billing dashboards.

It worked. But every time I shipped a new feature, I felt the friction. Adding a new document meant writing to both Pinecone and PostgreSQL in a transaction-like flow that wasn't actually transactional. Keeping them in sync required a reconciliation job that ran every six hours. When Pinecone had a brief outage in March, our search went down while the rest of the app kept running perfectly — because of course, our "real" database was fine.

The breaking point was the bill. We were storing about 2 million vectors (1536 dimensions, OpenAI ada-002) and doing maybe 50,000 queries per day. Pinecone was charging us around $70/month on the Starter plan, which isn't much in absolute terms, but the number that bothered me was the operational cost: roughly 15% of my infrastructure time was spent on Pinecone-related sync issues, debugging, and monitoring. For a feature that PostgreSQL could handle natively.

I spent a Saturday migrating everything to pgvector. By Sunday evening, the entire RAG pipeline was running on a single PostgreSQL 16 instance. Query latency went from ~45ms (Pinecone) to ~35ms (pgvector with HNSW). The sync job was deleted. The reconciliation cron was deleted. One database, one source of truth, one bill.

This is the guide I wish I had found before that Saturday.

Setting Up pgvector: From Zero to Vector Search in 10 Minutes

pgvector is a PostgreSQL extension, not a fork. This matters because you keep everything — your existing tables, indexes, extensions, connection pooling, backups, replication. You're just adding a new data type and a few operators.

Installation

On Ubuntu/Debian with PostgreSQL 16:

-- Install the extension (run as superuser)
sudo apt install postgresql-16-pgvector

-- Or build from source (latest version)
cd /tmp
git clone --branch v0.7.4 https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install

Then enable it in your database:

CREATE EXTENSION IF NOT EXISTS vector;

That's it. No separate service, no Docker container, no cluster configuration, no API keys. If you're using a managed PostgreSQL service (RDS, Cloud SQL, Supabase, Neon), most of them support pgvector out of the box — just run the CREATE EXTENSION statement.

Creating Your First Vector Table

Here's a real-world schema for a document search system. Notice how the vector column lives alongside normal relational data:

CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    source_url TEXT,
    category_id INTEGER REFERENCES categories(id),
    embedding vector(1536),  -- OpenAI ada-002 dimension
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Full-text search support (we'll combine this with vector search later)
ALTER TABLE documents
    ADD COLUMN content_tsv tsvector
    GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || content)) STORED;

CREATE INDEX idx_documents_content_tsv ON documents USING gin(content_tsv);

The vector(1536) type stores a fixed-dimension float array. The dimension must match your embedding model: 1536 for OpenAI ada-002/text-embedding-3-small, 3072 for text-embedding-3-large, 768 for many open-source models like BGE or E5, 384 for all-MiniLM-L6-v2.

Inserting Embeddings from Python

Here's how I actually insert documents with embeddings in production. I use psycopg (v3) with pgvector's Python integration:

import psycopg
from pgvector.psycopg import register_vector
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    """Get embedding from OpenAI API."""
    response = client.embeddings.create(input=[text], model=model)
    return response.data[0].embedding

def insert_document(conn, title: str, content: str, source_url: str = None):
    """Insert a document with its embedding."""
    # Combine title and content for embedding (weighted toward title)
    embed_text = f"{title}\n\n{content[:8000]}"  # Truncate for token limits
    embedding = get_embedding(embed_text)

    with conn.cursor() as cur:
        cur.execute(
            """
            INSERT INTO documents (title, content, source_url, embedding)
            VALUES (%s, %s, %s, %s::vector)
            RETURNING id
            """,
            (title, content, source_url, embedding)
        )
        doc_id = cur.fetchone()[0]
        conn.commit()
        return doc_id

# Usage
conn = psycopg.connect("postgresql://user:pass@localhost/mydb")
register_vector(conn)

insert_document(conn, "How to tune PostgreSQL for analytics", "PostgreSQL can be ...")

For batch inserts, use executemany or COPY — inserting embeddings one by one over the network is painfully slow. I typically batch 500-1000 documents at a time:

from psycopg.rows import namedtuple_row

def batch_insert_documents(conn, documents: list[dict]):
    """Batch insert documents with embeddings using COPY."""
    # Pre-compute all embeddings (batch API call is cheaper)
    texts = [f"{d['title']}\n\n{d['content'][:8000]}" for d in documents]
    response = client.embeddings.create(input=texts, model="text-embedding-3-small")
    embeddings = [item.embedding for item in response.data]

    with conn.cursor() as cur:
        with cur.copy(
            "COPY documents (title, content, source_url, embedding) FROM STDIN"
        ) as copy:
            for doc, emb in zip(documents, embeddings):
                copy.write_row((
                    doc['title'],
                    doc['content'],
                    doc.get('source_url'),
                    str(emb)  # pgvector accepts string representation
                ))
    conn.commit()

Indexing Strategies: IVFFlat vs HNSW (and When Each Wins)

Without an index, pgvector does exact nearest-neighbor search — it scans every single vector in the table and computes the distance. This is perfectly fine for small tables (under ~50,000 rows in my experience), but it doesn't scale. At 1 million rows with 1536 dimensions, an exact scan takes 800ms-1.2 seconds. That's not usable for a production API.

pgvector offers two approximate nearest-neighbor (ANN) index types. Choosing between them is the most important performance decision you'll make.

IVFFlat: The Fast-to-Build Option

IVFFlat (Inverted File with Flat quantization) partitions your vectors into clusters using k-means, then searches only the closest clusters at query time.

-- Create IVFFlat index
-- lists = number of clusters (rule of thumb: sqrt(num_rows) for < 1M rows)
CREATE INDEX idx_documents_embedding_ivfflat
    ON documents
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 1000);

-- At query time, control how many clusters to search (more = better recall, slower)
SET ivfflat.probes = 20;  -- default is 1, I typically use 10-30

Pros: Faster to build (2-5x faster than HNSW), less memory during build, good enough recall for many use cases.

Cons: Lower recall at the same query speed compared to HNSW. Needs rebuilding when data distribution shifts significantly. The probes parameter requires tuning — too low and you miss relevant results, too high and you lose the speed benefit.

When to use IVFFlat: When your data changes frequently and you need to rebuild indexes often, or when you're on a memory-constrained instance and can't afford HNSW's build-time RAM usage.

HNSW: The Default Choice for Production

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph where each vector is connected to its approximate neighbors. Queries traverse this graph from top to bottom, narrowing in on the nearest neighbors efficiently.

-- Create HNSW index
-- m = max connections per node (default 16, higher = better recall, more memory)
-- ef_construction = build-time search width (default 64, higher = better index quality)
CREATE INDEX idx_documents_embedding_hnsw
    ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 128);

-- At query time, control search width (more = better recall, slower)
SET hnsw.ef_search = 100;  -- default is 40, I use 100-200 for production

Pros: Better recall at the same query speed. No need to rebuild when adding data (it's incrementally updated). More predictable performance.

Cons: Slower to build (a 2M row table takes ~20 minutes vs ~8 minutes for IVFFlat on my machine). Uses significantly more memory during build. Larger index on disk.

When to use HNSW: Almost always. It's my default. Unless you have a specific reason to choose IVFFlat, start here.

Distance Operators

pgvector supports three distance functions, each with its own operator and index ops class:

Distance Operator Index Ops Class When to Use
Cosine <=> vector_cosine_ops Normalized embeddings (OpenAI, Cohere). Most common choice.
L2 (Euclidean) <-> vector_l2_ops When magnitude matters (image features, some scientific data).
Inner Product <#> vector_ip_ops Pre-normalized vectors where you want max similarity (returns negative distance).

If you're using OpenAI or Cohere embeddings, use cosine distance. The embeddings are already normalized, so cosine and inner product give the same ranking, but cosine is more intuitive (0 = identical, 2 = opposite).

Performance Tuning: Getting pgvector Below 50ms

Out of the box, pgvector performance is mediocre. With tuning, it's competitive with dedicated vector databases. Here's what actually moved the needle for me.

PostgreSQL Memory Settings

-- These go in postgresql.conf or ALTER SYSTEM SET

-- Give PostgreSQL more memory for caching index pages
-- For a dedicated vector search instance, I set this to 40-50% of RAM
ALTER SYSTEM SET shared_buffers = '4GB';

-- Work memory per query — HNSW traversal benefits from large work_mem
ALTER SYSTEM SET work_mem = '256MB';

-- Effective cache size should reflect total available memory
ALTER SYSTEM SET effective_cache_size = '12GB';

-- Maintenance work memory for index builds
ALTER SYSTEM SET maintenance_work_mem = '2GB';

SELECT pg_reload_conf();

Keep Hot Indexes in Memory

The single biggest performance improvement came from ensuring the HNSW index fits in shared_buffers. For 2 million vectors at 1536 dimensions, the HNSW index is roughly 15-18 GB. On a 32 GB instance with shared_buffers = 12GB, the index is partially cached. On a 64 GB instance with shared_buffers = 24GB, it fits entirely in memory, and query latency drops from ~40ms to ~12ms.

Check how much of your index is cached:

-- Install pg_buffercache extension
CREATE EXTENSION IF NOT EXISTS pg_buffercache;

-- Check index cache hit ratio
SELECT
    c.relname AS index_name,
    pg_size_pretty(pg_relation_size(c.oid)) AS index_size,
    count(*) AS buffers_cached,
    pg_size_pretty(count(*) * 8192) AS cached_size,
    round(100.0 * count(*) / (pg_relation_size(c.oid) / 8192), 1) AS pct_cached
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
WHERE c.relname LIKE '%embedding%'
GROUP BY c.relname, c.oid;

Connection Pooling Matters More Than You Think

Every pgvector query allocates memory proportional to the ef_search parameter and the vector dimension. With 100 concurrent connections each running HNSW searches, you can spike memory usage by several GB. I use PgBouncer in transaction mode with a pool of 20-30 connections, which keeps memory stable and actually improves throughput under load because PostgreSQL handles 20 concurrent HNSW searches much better than 100.

Hybrid Search: Where pgvector Truly Outshines Dedicated Vector DBs

This is the killer feature. In Pinecone, if I wanted to combine semantic similarity with keyword matching, I'd need to run two queries (one in Pinecone, one in PostgreSQL) and merge the results in application code. With pgvector, it's a single SQL query:

-- Hybrid search: combine vector similarity with full-text search
-- This is the query I use in production for our RAG pipeline

WITH semantic AS (
    -- Vector similarity search
    SELECT
        id,
        title,
        content,
        1 - (embedding <=> $1::vector) AS semantic_score
    FROM documents
    WHERE category_id = ANY($3)  -- Pre-filter by category
    ORDER BY embedding <=> $1::vector
    LIMIT 40
),
keyword AS (
    -- Full-text search
    SELECT
        id,
        title,
        content,
        ts_rank_cd(content_tsv, websearch_to_tsquery('english', $2)) AS keyword_score
    FROM documents
    WHERE content_tsv @@ websearch_to_tsquery('english', $2)
      AND category_id = ANY($3)
    LIMIT 40
)
-- Reciprocal Rank Fusion (RRF) to combine both result sets
SELECT
    COALESCE(s.id, k.id) AS id,
    COALESCE(s.title, k.title) AS title,
    COALESCE(s.content, k.content) AS content,
    COALESCE(s.semantic_score, 0) AS semantic_score,
    COALESCE(k.keyword_score, 0) AS keyword_score,
    -- RRF formula: 1/(k+rank). k=60 is standard.
    COALESCE(1.0 / (60 + ROW_NUMBER() OVER (ORDER BY s.semantic_score DESC NULLS LAST)), 0) +
    COALESCE(1.0 / (60 + ROW_NUMBER() OVER (ORDER BY k.keyword_score DESC NULLS LAST)), 0)
        AS rrf_score
FROM semantic s
FULL OUTER JOIN keyword k ON s.id = k.id
ORDER BY rrf_score DESC
LIMIT 10;

This pattern — Reciprocal Rank Fusion over semantic and keyword results — consistently outperforms either approach alone in my testing. Semantic search catches paraphrases and conceptual matches; keyword search catches exact terms, proper nouns, and code identifiers that embedding models often fumble.

Python Implementation for RAG

Here's the complete search function I use in production:

import psycopg
from pgvector.psycopg import register_vector
from dataclasses import dataclass

@dataclass
class SearchResult:
    id: int
    title: str
    content: str
    semantic_score: float
    keyword_score: float
    rrf_score: float

def hybrid_search(
    conn,
    query: str,
    category_ids: list[int] | None = None,
    limit: int = 10,
    semantic_weight: float = 0.7,
) -> list[SearchResult]:
    """
    Hybrid vector + full-text search with RRF ranking.

    Args:
        query: User's search query (natural language)
        category_ids: Optional filter by categories
        limit: Number of results to return
        semantic_weight: Weight for semantic vs keyword (0-1)
    """
    # Get query embedding
    query_embedding = get_embedding(query)

    sql = """
    WITH semantic AS (
        SELECT id, title, content,
               1 - (embedding <=> %(embedding)s::vector) AS score
        FROM documents
        WHERE (%(cats)s IS NULL OR category_id = ANY(%(cats)s))
        ORDER BY embedding <=> %(embedding)s::vector
        LIMIT %(pool)s
    ),
    keyword AS (
        SELECT id, title, content,
               ts_rank_cd(content_tsv, websearch_to_tsquery('english', %(query)s)) AS score
        FROM documents
        WHERE content_tsv @@ websearch_to_tsquery('english', %(query)s)
          AND (%(cats)s IS NULL OR category_id = ANY(%(cats)s))
        LIMIT %(pool)s
    ),
    combined AS (
        SELECT
            COALESCE(s.id, k.id) AS id,
            COALESCE(s.title, k.title) AS title,
            COALESCE(s.content, k.content) AS content,
            COALESCE(s.score, 0) AS semantic_score,
            COALESCE(k.score, 0) AS keyword_score
        FROM semantic s
        FULL OUTER JOIN keyword k ON s.id = k.id
    )
    SELECT *,
           %(sw)s * semantic_score + (1 - %(sw)s) * keyword_score AS rrf_score
    FROM combined
    ORDER BY rrf_score DESC
    LIMIT %(limit)s
    """

    with conn.cursor(row_factory=namedtuple_row) as cur:
        cur.execute(sql, {
            "embedding": query_embedding,
            "query": query,
            "cats": category_ids,
            "pool": limit * 4,  # Oversample for better fusion
            "sw": semantic_weight,
            "limit": limit,
        })
        return [SearchResult(**row._asdict()) for row in cur.fetchall()]

Benchmark: pgvector vs Dedicated Vector Databases

I ran these benchmarks on a real workload: 2 million document embeddings (1536 dimensions, OpenAI text-embedding-3-small), querying with 1,000 random queries at various concurrency levels. The PostgreSQL instance was running on a 16 vCPU, 64 GB RAM machine (comparable to what I'd use in production). Dedicated vector DBs were on their recommended tier for this dataset size.

Metric pgvector (HNSW) Pinecone (s1.x1) Weaviate (8 vCPU) Qdrant (8 vCPU)
p50 Latency (1 client) 12ms 18ms 9ms 8ms
p99 Latency (1 client) 35ms 65ms 22ms 19ms
p50 Latency (20 clients) 28ms 24ms 18ms 14ms
p99 Latency (20 clients) 85ms 120ms 55ms 42ms
Recall@10 (ef_search=200) 0.98 0.95 0.97 0.99
Index Build Time 22 min ~8 min (upsert) ~15 min ~12 min
Monthly Cost ~$0 (existing PG) $70-350 ~$180 (self-host) ~$150 (self-host)
Hybrid Search Native (SQL) Metadata filter only BM25 module Payload filter only
ACID Transactions Yes No No No
Ops Overhead None (existing PG) Low (managed) Medium (self-host) Medium (self-host)

The numbers tell a clear story: Qdrant and Weaviate are faster at pure vector search, especially under high concurrency. Pinecone is the easiest to set up but the most expensive and has the worst p99 latency (network round-trips to a managed service). pgvector is in the middle on raw latency but wins on cost, hybrid search capability, and operational simplicity.

When pgvector Is Enough (and When It's Not)

After running pgvector in production for over a year across three different projects, I've developed a pretty clear mental model for when it's the right choice and when you should reach for something else.

pgvector Is Enough When:

  • You're already running PostgreSQL. The operational cost of adding another database to your stack is real. Every new service means new monitoring, new backups, new failure modes, new on-call runbooks. If PostgreSQL is your primary database, pgvector is free incremental complexity.
  • Your dataset is under 10 million vectors. At this scale, pgvector on a reasonable instance (16+ GB RAM, NVMe storage) provides sub-100ms latency with HNSW indexes. Most applications live in this range.
  • You need hybrid search. Combining vector similarity with keyword matching, metadata filtering, or relational joins in a single query is pgvector's superpower. No other solution does this as cleanly.
  • You need ACID guarantees. When your vector data must be consistent with relational data (e.g., deleting a user should delete their embeddings), transactions matter. Dedicated vector DBs don't give you cross-table transactions with your application data.
  • Your team knows SQL. The learning curve for pgvector is approximately zero for any team that already knows PostgreSQL. The learning curve for Weaviate's GraphQL API or Qdrant's gRPC interface is not zero.

Consider a Dedicated Vector DB When:

  • You have 50+ million vectors. At this scale, pgvector's single-node architecture becomes a bottleneck. Qdrant and Weaviate support horizontal sharding natively. Pinecone handles it transparently. PostgreSQL doesn't shard vectors (Citus doesn't help here).
  • You need sub-10ms p99 latency at high concurrency. If you're building a real-time recommendation engine serving 1,000+ QPS, the dedicated vector DBs' purpose-built query engines will outperform pgvector.
  • You're doing multi-modal search across billions of vectors. Think image search at Pinterest/Google scale. This is not pgvector's arena.
  • Your team has no PostgreSQL experience and you're starting from scratch. Pinecone's managed API or Weaviate's Docker setup might genuinely be faster to get started with.

Production Checklist: Deploying pgvector for Real

Here's the checklist I use when deploying pgvector-backed search to production. I've learned each of these from a specific production incident or performance regression:

  1. Use HNSW indexes with ef_construction = 128 and m = 16 as your starting point. Only increase these if recall testing shows you need better accuracy.
  2. Set hnsw.ef_search per-session, not globally. Your search API endpoint might want ef_search = 200, but your nightly batch job that pre-computes related documents is fine with ef_search = 40.
  3. Monitor index size vs shared_buffers. If your HNSW index doesn't fit in memory, query latency will spike unpredictably as pages get evicted and reloaded from disk.
  4. Use connection pooling. Each concurrent HNSW search allocates working memory. PgBouncer in transaction mode with 20-30 connections is a good default for a 16 vCPU instance.
  5. Build indexes with maintenance_work_mem cranked up. Set it to 4-8 GB before an index build, then reset. The build is ~3x faster with more memory.
  6. Test recall, not just latency. Build a ground truth set by running exact search (SET enable_seqscan = off won't help; you need exact search on a subset) and measure recall@10 against your ANN results. Aim for 0.95+ recall.
  7. Vacuum and analyze after large batch inserts. HNSW indexes handle incremental inserts well, but the PostgreSQL planner needs updated statistics to choose the right query plan.
  8. Pre-filter before vector search when possible. Adding a WHERE category_id = 5 to your vector query uses a partial index or filters before the ANN scan, which is much faster than filtering after.
-- Partial HNSW index for a specific high-traffic category
CREATE INDEX idx_docs_embedding_tech
    ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 128)
    WHERE category_id = 5;  -- 'Technology' category

The Bigger Picture: Stop Overengineering Your Vector Search

The vector database market has exploded over the past two years. Every month there's a new startup, a new benchmark, a new claim about trillion-scale search. And most of it is irrelevant to the vast majority of developers building AI-powered applications.

If you're building a RAG chatbot, a semantic search feature, a recommendation engine, or a duplicate detection system — and your dataset fits on a single machine — pgvector is almost certainly enough. You get vector search, full-text search, relational queries, ACID transactions, battle-tested backup/replication, and decades of PostgreSQL ecosystem tooling. In one database. That you're probably already running.

The dedicated vector databases are excellent engineering. Qdrant's performance is genuinely impressive. Weaviate's module ecosystem is well-designed. Pinecone's managed experience is smooth. But they're solving a scale problem that most of us don't have. And adding a new database to your stack to solve a problem you don't have is the kind of overengineering that comes back to haunt you at 3 AM when the sync job fails and your search results are stale.

Install pgvector. Create an HNSW index. Write a hybrid search query. Ship it. You can always migrate to something fancier later if you actually hit the limits — but I bet you won't.

Update (December 2025): pgvector 0.8.0 is in development and includes support for scalar and binary quantization, which will reduce memory usage by 4-32x for large deployments. The gap between pgvector and dedicated vector databases continues to narrow.

Leave a Comment