RAG on Databricks

RAG on Databricks: Production Patterns with Mosaic AI Vector Search


Why this matters

You’ve shipped a proof-of-concept chatbot. It works—until real users throw messy PDFs, fast-changing policies, and vague questions at it. The model is fine; the retrieval isn’t. On Databricks, Mosaic AI Vector Search plus solid data engineering turns flaky RAG into a maintainable product with SLAs. This guide shows the patterns that actually hold up in production. Databricks Documentation


Architecture at a glance

Core pieces

Data flow

  1. Land files to a UC Volume → 2) Ingest & normalize to Delta → 3) Chunk + embed → 4) Upsert to Vector Index → 5) Query with hybrid + rerank → 6) Ground LLM via Model Serving → 7) Evaluate and iterate. Databricks Documentation+5Databricks Documentation+5Databricks Documentation+5

Indexing patterns that scale

1) Delta-sourced, auto-synced indexes

Use Delta tables as the source of truth and let Mosaic AI Vector Search sync changes to the index. It keeps ingestion simple and supports near-real-time updates without redeploys. Databricks

When to use: steady trickle of documents, frequent small updates.

2) Storage-optimized endpoints for large corpora

If you expect tens of millions of chunks, pick a storage-optimized endpoint and enable hybrid retrieval from day one to avoid expensive rework. Databricks Documentation

3) Hybrid search by default

Combine dense vectors + keyword (BM25) to handle numbers, acronyms, and rare terms. This shipped as GA and is the pragmatic default. Databricks

4) Reranking for precision@k

Turn on reranking (Public Preview) to boost top-k precision without changing chunking. We’ve seen double-digit gains on enterprise sets with minimal plumbing. Databricks


Practical chunking & metadata

  • Semantic chunking: split by headings or layout, not fixed tokens. Store SECTION_TITLE, DOC_ID, PAGE_RANGE. Databricks Community
  • Tight chunks (300–800 tokens): reduce hallucinations; include a breadcrumb (title > h2 > h3) in the chunk text. Databricks Community
  • Cross-doc dedupe: hash normalized text to drop near-duplicates before embedding.
  • Freshness windows: track EFFECTIVE_FROM/TO to answer “according to the latest policy.”

Example Delta schema (simplified)

columntypenotes
doc_idSTRINGstable ID
section_pathSTRINGe.g., Manual > Safety > PPE
chunk_textSTRINGcleaned text
source_uriSTRINGUC volume path
effective_fromTIMESTAMPfor recency filtering
embeddingVECTORmodel-specific

Creating and querying a Vector Search index

Python: create endpoint & index

# Databricks notebook
from databricks.vector_search.client import VectorSearchClient

vsc = VectorSearchClient()

# 1) Create endpoint (choose "standard" or "storage_optimized")
vsc.create_endpoint(
    name="vs-prod",
    endpoint_type="storage_optimized",  # plan for scale
)

# 2) Create an index sourced from a Delta table
vsc.create_delta_sync_index(
    endpoint_name="vs-prod",
    index_name="uc.catalog.rag_chunks_idx",
    source_table="uc.catalog.rag_chunks",  # has 'embedding' vector column + metadata
    pipeline_type="TRIGGERED"              # or CONTINUOUS
)

Docs: endpoint & index creation in Mosaic AI Vector Search. Databricks Documentation

SQL: query with vector_search()

-- Retrieve k=8 candidates via hybrid retrieval (vectors + keywords)
SELECT *
FROM vector_search(
  index => 'uc.catalog.rag_chunks_idx',
  query => 'What are the PPE requirements for lab B?',
  num_results => 8,
  search_type => 'HYBRID'  -- DENSE | SPARSE | HYBRID
);

Docs: vector_search() function (Public Preview) for Databricks SQL. Microsoft Learn

Python: rerank + metadata filter

results = vsc.query(
    endpoint_name="vs-prod",
    index_name="uc.catalog.rag_chunks_idx",
    query_text="What are the PPE requirements for lab B?",
    filters={"effective_from": {">=": "2025-01-01"}},
    num_results=20,
    search_type="HYBRID",
    rerank=True,             # enable preview reranker
    rerank_top_k=10
)

Reranking in Vector Search (Public Preview) improves retrieval quality with a single parameter. Databricks


Orchestrating the RAG pipeline on Databricks

IngestionNormalizationChunkingEmbeddingIndex upsertEvals & logging. Use UC Volumes for raw files, Delta Live Tables or Jobs for pipelines, and log all retrievals/answers for offline eval. Databricks Documentation+1

Model Serving hosts your LLM (open-source or custom) or connects to external providers through external model endpoints. Keep grounding strict: pass only top-k chunks, their citations, and metadata. Databricks Documentation


Retrieval strategies that work

  • Two-stage retrieval: HYBRID (k=50) → Rerank to top-10 → LLM context (k=5). Databricks+1
  • Time-aware recall: filter by effective_from or last N days; helpful for policy/data versioning.
  • Fielded filters: product line, region, version—kept as index metadata to avoid post-filtering drops. Databricks Documentation
  • Answer synthesis: instruct the LLM to quote sources with doc_id and section_path.

Evaluation & observability (don’t skip)

  • Golden sets: curated Q/A with references; track exact-match and faithfulness.
  • Retrieval metrics: recall@k, MRR, rerank deltas; alert when recall slips. Databricks Documentation
  • Drift monitors: new document distributions, chunk length stats, embedding norms.
  • Human review loop: sample low-confidence answers; write back labels for training.

Databricks’ RAG guidance and cookbook materials are good starting points for eval workflows. Databricks Documentation+1


Security & governance

  • Unity Catalog controls who can read raw files, tables, and vector indexes.
  • PII hygiene: redact before chunking; embargo sensitive classes from retrieval.
  • Data residency: align serving endpoints and storage to the right geo. Databricks notes residency for foundation model APIs and serving. Databricks Documentation

Common pitfalls (and fixes)

  • Oversized chunks → context bloat, hallucinations. Fix: 300–800 tokens, semantic splits. Databricks Community
  • Vector-only search → misses rare terms and IDs. Fix: default to HYBRID. Databricks
  • No rerank → on-topic but wrong top-1. Fix: enable reranking. Databricks
  • Index drift → stale answers. Fix: Delta-synced indexes, CI tests to confirm coverage. Databricks
  • Opaque answers → trust issues. Fix: show citations with doc_id and section_path.

Reference implementation (end-to-end snippet)

# 1) Ingest → chunk → embed -> write to Delta
# columns: doc_id, section_path, chunk_text, effective_from, embedding

# 2) Ensure index is synced to uc.catalog.rag_chunks_idx (see earlier)

# 3) Query + compose prompt
q = "Summarize PPE requirements for lab B and cite the source."

hits = vsc.query(
    endpoint_name="vs-prod",
    index_name="uc.catalog.rag_chunks_idx",
    query_text=q,
    num_results=32,
    search_type="HYBRID",
    rerank=True,
    metadata_filters={"section_path": {"contains": "Safety"}}
)

context = "\n\n".join([
    f"[{h['doc_id']} | {h['section_path']}] {h['chunk_text']}"
    for h in hits[:6]
])

prompt = f"""You are a compliance assistant.
Use only CONTEXT to answer and cite [doc_id | section_path].

QUESTION: {q}

CONTEXT:
{context}
"""

# 4) Call Model Serving endpoint with prompt (omitted: requests to endpoint)

Production checklist (copy/paste)


Internal link ideas

  • Designing semantic chunking pipelines on Databricks
  • Choosing embedding models and dimensionality for enterprise RAG
  • Evaluating RAG with golden sets and automated drift checks
  • Governance patterns with Unity Catalog for vector indexes
  • Cost-aware scaling: standard vs storage-optimized endpoints

Summary & CTA

RAG fails when retrieval fails. On Databricks, you stabilize it with Delta-sourced, auto-synced indexes, HYBRID retrieval, and reranking, stitched together by well-governed pipelines and Model Serving. Start with the checklist above, ship a thin slice, and instrument the heck out of it. Your LLM will look smarter because your retrieval is. Want a turn-key notebook + jobs YAML tailored to your stack? Say “RAG kit” and I’ll generate one.


Image prompt

“A clean, modern data architecture diagram illustrating a Databricks RAG pipeline: UC Volumes/Delta ingestion, chunking/embeddings, Mosaic AI Vector Search with hybrid search + reranking, Model Serving, and evaluation loop — minimalistic, high contrast, 3D isometric style.”

Tags

#RAG #Databricks #MosaicAI #VectorSearch #LLMOps #UnityCatalog #DataEngineering #HybridSearch #ModelServing

Leave a Reply

Your email address will not be published. Required fields are marked *