RAG on Databricks: Production Patterns with Mosaic AI Vector Search

Why this matters

You’ve shipped a proof-of-concept chatbot. It works—until real users throw messy PDFs, fast-changing policies, and vague questions at it. The model is fine; the retrieval isn’t. On Databricks, Mosaic AI Vector Search plus solid data engineering turns flaky RAG into a maintainable product with SLAs. This guide shows the patterns that actually hold up in production. Databricks Documentation

Architecture at a glance

Core pieces

Data lakehouse (Delta/UC): documents, tables, and metadata with governance in Unity Catalog. Databricks Documentation
Ingestion & chunking: pipelines standardize formats, chunk, and embed. Databricks Documentation+1
Mosaic AI Vector Search: managed indexes + endpoints; hybrid search and reranking for quality. Databricks Documentation+2Databricks+2
Model Serving: host your LLM or call external models via endpoints. Databricks Documentation+1
RAG app / agents: query index → rank → ground LLM → log, evaluate, and retrain. Databricks Documentation+1

Data flow

Land files to a UC Volume → 2) Ingest & normalize to Delta → 3) Chunk + embed → 4) Upsert to Vector Index → 5) Query with hybrid + rerank → 6) Ground LLM via Model Serving → 7) Evaluate and iterate. Databricks Documentation+5Databricks Documentation+5Databricks Documentation+5

Indexing patterns that scale

1) Delta-sourced, auto-synced indexes

Use Delta tables as the source of truth and let Mosaic AI Vector Search sync changes to the index. It keeps ingestion simple and supports near-real-time updates without redeploys. Databricks

When to use: steady trickle of documents, frequent small updates.

2) Storage-optimized endpoints for large corpora

If you expect tens of millions of chunks, pick a storage-optimized endpoint and enable hybrid retrieval from day one to avoid expensive rework. Databricks Documentation

3) Hybrid search by default

Combine dense vectors + keyword (BM25) to handle numbers, acronyms, and rare terms. This shipped as GA and is the pragmatic default. Databricks

4) Reranking for precision@k

Turn on reranking (Public Preview) to boost top-k precision without changing chunking. We’ve seen double-digit gains on enterprise sets with minimal plumbing. Databricks

Practical chunking & metadata

Semantic chunking: split by headings or layout, not fixed tokens. Store SECTION_TITLE, DOC_ID, PAGE_RANGE. Databricks Community
Tight chunks (300–800 tokens): reduce hallucinations; include a breadcrumb (title > h2 > h3) in the chunk text. Databricks Community
Cross-doc dedupe: hash normalized text to drop near-duplicates before embedding.
Freshness windows: track EFFECTIVE_FROM/TO to answer “according to the latest policy.”

Example Delta schema (simplified)

column	type	notes
doc_id	STRING	stable ID
section_path	STRING	e.g., `Manual > Safety > PPE`
chunk_text	STRING	cleaned text
source_uri	STRING	UC volume path
effective_from	TIMESTAMP	for recency filtering
embedding	VECTOR	model-specific

Creating and querying a Vector Search index

Python: create endpoint & index

# Databricks notebook
from databricks.vector_search.client import VectorSearchClient

vsc = VectorSearchClient()

# 1) Create endpoint (choose "standard" or "storage_optimized")
vsc.create_endpoint(
    name="vs-prod",
    endpoint_type="storage_optimized",  # plan for scale
)

# 2) Create an index sourced from a Delta table
vsc.create_delta_sync_index(
    endpoint_name="vs-prod",
    index_name="uc.catalog.rag_chunks_idx",
    source_table="uc.catalog.rag_chunks",  # has 'embedding' vector column + metadata
    pipeline_type="TRIGGERED"              # or CONTINUOUS
)

Docs: endpoint & index creation in Mosaic AI Vector Search. Databricks Documentation

SQL: query with `vector_search()`

-- Retrieve k=8 candidates via hybrid retrieval (vectors + keywords)
SELECT *
FROM vector_search(
  index => 'uc.catalog.rag_chunks_idx',
  query => 'What are the PPE requirements for lab B?',
  num_results => 8,
  search_type => 'HYBRID'  -- DENSE | SPARSE | HYBRID
);

Docs: vector_search() function (Public Preview) for Databricks SQL. Microsoft Learn

Python: rerank + metadata filter

results = vsc.query(
    endpoint_name="vs-prod",
    index_name="uc.catalog.rag_chunks_idx",
    query_text="What are the PPE requirements for lab B?",
    filters={"effective_from": {">=": "2025-01-01"}},
    num_results=20,
    search_type="HYBRID",
    rerank=True,             # enable preview reranker
    rerank_top_k=10
)

Reranking in Vector Search (Public Preview) improves retrieval quality with a single parameter. Databricks

Orchestrating the RAG pipeline on Databricks

Ingestion → Normalization → Chunking → Embedding → Index upsert → Evals & logging. Use UC Volumes for raw files, Delta Live Tables or Jobs for pipelines, and log all retrievals/answers for offline eval. Databricks Documentation+1

Model Serving hosts your LLM (open-source or custom) or connects to external providers through external model endpoints. Keep grounding strict: pass only top-k chunks, their citations, and metadata. Databricks Documentation

Retrieval strategies that work

Two-stage retrieval: HYBRID (k=50) → Rerank to top-10 → LLM context (k=5). Databricks+1
Time-aware recall: filter by effective_from or last N days; helpful for policy/data versioning.
Fielded filters: product line, region, version—kept as index metadata to avoid post-filtering drops. Databricks Documentation
Answer synthesis: instruct the LLM to quote sources with doc_id and section_path.

Evaluation & observability (don’t skip)

Golden sets: curated Q/A with references; track exact-match and faithfulness.
Retrieval metrics: recall@k, MRR, rerank deltas; alert when recall slips. Databricks Documentation
Drift monitors: new document distributions, chunk length stats, embedding norms.
Human review loop: sample low-confidence answers; write back labels for training.

Databricks’ RAG guidance and cookbook materials are good starting points for eval workflows. Databricks Documentation+1

Security & governance

Unity Catalog controls who can read raw files, tables, and vector indexes.
PII hygiene: redact before chunking; embargo sensitive classes from retrieval.
Data residency: align serving endpoints and storage to the right geo. Databricks notes residency for foundation model APIs and serving. Databricks Documentation

Common pitfalls (and fixes)

Oversized chunks → context bloat, hallucinations. Fix: 300–800 tokens, semantic splits. Databricks Community
Vector-only search → misses rare terms and IDs. Fix: default to HYBRID. Databricks
No rerank → on-topic but wrong top-1. Fix: enable reranking. Databricks
Index drift → stale answers. Fix: Delta-synced indexes, CI tests to confirm coverage. Databricks
Opaque answers → trust issues. Fix: show citations with doc_id and section_path.

Reference implementation (end-to-end snippet)

# 1) Ingest → chunk → embed -> write to Delta
# columns: doc_id, section_path, chunk_text, effective_from, embedding

# 2) Ensure index is synced to uc.catalog.rag_chunks_idx (see earlier)

# 3) Query + compose prompt
q = "Summarize PPE requirements for lab B and cite the source."

hits = vsc.query(
    endpoint_name="vs-prod",
    index_name="uc.catalog.rag_chunks_idx",
    query_text=q,
    num_results=32,
    search_type="HYBRID",
    rerank=True,
    metadata_filters={"section_path": {"contains": "Safety"}}
)

context = "\n\n".join([
    f"[{h['doc_id']} | {h['section_path']}] {h['chunk_text']}"
    for h in hits[:6]
])

prompt = f"""You are a compliance assistant.
Use only CONTEXT to answer and cite [doc_id | section_path].

QUESTION: {q}

CONTEXT:
{context}
"""

# 4) Call Model Serving endpoint with prompt (omitted: requests to endpoint)

Production checklist (copy/paste)

UC Volumes for raw files; Delta bronze/silver tables. Databricks Documentation
Semantic chunking + breadcrumbs; dedupe. Databricks Community
Delta-synced Mosaic AI Vector Search index. Databricks
HYBRID retrieval; reranking enabled. Databricks+1
Retrieval logging; golden sets; eval jobs. Databricks Documentation
Model Serving with strict grounding + citations. Databricks Documentation

Internal link ideas

Designing semantic chunking pipelines on Databricks
Choosing embedding models and dimensionality for enterprise RAG
Evaluating RAG with golden sets and automated drift checks
Governance patterns with Unity Catalog for vector indexes
Cost-aware scaling: standard vs storage-optimized endpoints

Summary & CTA

RAG fails when retrieval fails. On Databricks, you stabilize it with Delta-sourced, auto-synced indexes, HYBRID retrieval, and reranking, stitched together by well-governed pipelines and Model Serving. Start with the checklist above, ship a thin slice, and instrument the heck out of it. Your LLM will look smarter because your retrieval is. Want a turn-key notebook + jobs YAML tailored to your stack? Say “RAG kit” and I’ll generate one.

Image prompt

“A clean, modern data architecture diagram illustrating a Databricks RAG pipeline: UC Volumes/Delta ingestion, chunking/embeddings, Mosaic AI Vector Search with hybrid search + reranking, Model Serving, and evaluation loop — minimalistic, high contrast, 3D isometric style.”

Data/ML Engineer Blog