JSONB Without Regret

JSONB Without Regret: When to Model vs Index Paths

Decision tree + costed examples for PostgreSQL engineers

Introduction — the “just JSON it” trap

You ship a feature fast by tossing semi-structured payloads into jsonb. Two quarters later, product wants sorting, filtering, and analytics on nested fields. Queries slow down, indexes balloon, and migrations feel radioactive. Sound familiar?

This guide gives you a ruthless decision tree and costed, reproducible examples so you can choose—confidently—between modeling columns vs indexing JSONB paths.


TL;DR Decision Tree

Q1. Is the field queried in every hot path (filters/joins/order-by) and stable?
Yes: Model as typed columns. Add B-Tree indexes. Optionally keep original JSONB for audit/raw.
No: go to Q2.

Q2. Is the access pattern selective (<10–20% rows) and on a few known paths?
Yes: Keep JSONB. Create GIN index on those paths (jsonb_path_ops or partial GIN).
No: go to Q3.

Q3. Do you need range sorts, prefix searches, or frequent ORDER BY on the path?
Yes: Model columns (B-Tree/Brin) or use functional indexes on extracted values.
No: go to Q4.

Q4. Will the schema evolve unpredictably across tenants/events?
Yes: JSONB-first. Add narrow partial GIN per tenant/type. Consider hybrid (promote only hot fields).
No: Model columns.

Q5. Is write amplification or index maintenance a bottleneck?
Yes: Prefer modeling (fewer/lighter indexes) and computed/generated columns.
No: JSONB with targeted GIN is fine.


Concepts You Must Weigh

  • Query selectivity: GIN shines when queries match a small fraction of rows. Low selectivity (e.g., 60% of table) wastes index lookups.
  • Cardinality & sorting: B-Tree excels at range scans and ORDER BY. GIN does not.
  • Write cost: GIN update cost grows with document size and path churn.
  • Index size & memory: GIN on wide JSONB can dwarf table size, pressuring cache and vacuum.
  • Evolution pressure: If fields change weekly, strict columns become a migration tax.
  • Observability: JSONB hides type drift; modeled columns force validation upfront.

Reference Architecture Patterns

1) Modeled Columns + Raw JSONB (Hybrid)

  • Table has typed columns for hot fields.
  • payload jsonb retains full event for audit.
  • Generated columns keep model in sync.

Why: Fast reads, small indexes, flexible raw storage.

2) JSONB-First with Targeted GIN

  • Everything lands in payload jsonb.
  • Create partial GIN indexes on few high-signal paths.
  • Use jsonb_path_ops for containment, or jsonb_ops for broader operators (bigger index).

Why: Fast iterate, pay-as-you-go indexing.

3) Extract-on-Write Materialization

  • Ingest into staging table with JSONB.
  • A trigger or job extracts hot fields into a fact table.
  • Analytics hit the fact table; product reads JSONB.

Why: Keeps OLTP and analytics healthy without over-indexing JSONB.


Costed, Reproducible Examples

All costs below are illustrative and meant to show orders of magnitude and trade-offs. Reproduce in your environment with EXPLAIN (ANALYZE, BUFFERS).

Setup

CREATE TABLE orders (
  id               bigserial primary key,
  created_at       timestamptz not null default now(),
  customer_id      bigint,
  -- hybrid: keep raw
  payload          jsonb not null
);

-- Sample JSONB:
-- {
--   "status": "shipped",
--   "total": 129.99,
--   "shipping": {"city":"Austin","state":"TX","zip":"78701"},
--   "items":[{"sku":"A1","qty":2},{"sku":"B2","qty":1}]
-- }

Workload

  • 10M orders.
  • Hot queries:
    1. Filter by status, time window.
    2. Top cities by shipped orders (group-by city).
    3. Look up orders containing SKU “B2”.
    4. Sort by total (range scans).

Option A — JSONB only, targeted indexes

Indexes

-- 1) status filter: narrow partial GIN on status 'shipped'
CREATE INDEX idx_orders_payload_status_shipped
  ON orders USING gin ((payload->>'status'))
  WHERE (payload->>'status') = 'shipped';

-- 2) city filter: functional B-Tree on extracted city (tiny index)
CREATE INDEX idx_orders_city
  ON orders ((payload->'shipping'->>'city'));

-- 3) items contains SKU: full GIN on items array (path ops)
CREATE INDEX idx_orders_items_gin
  ON orders USING gin ((payload->'items') jsonb_path_ops);

Representative costs

QueryExample SQLPlan ShapeLatency (cold→warm)Notes
Q1 status + timeWHERE payload->>'status'='shipped' AND created_at>=now()-'7d'::intervalBitmap Index Scan on partial + heap recheck120–900ms → 5–20msPartial + time predicate is selective; good.
Q2 top citiesSELECT payload->'shipping'->>'city', count(*) ... GROUP BY 1Index Only Scan on idx_orders_city + HashAgg300–1200ms → 30–60msFunctional B-Tree wins for grouping.
Q3 contains SKUWHERE payload->'items' @> '[{"sku":"B2"}]'Bitmap Index Scan on GIN (path_ops)400–1500ms → 20–50msGIN pays off when selective.
Q4 sort by totalORDER BY (payload->>'total')::numeric DESC LIMIT 50Seq Scan + Sort2–8s → 400–800msPain: GIN can’t help ORDER BY; consider modeling.

Storage impact (rules of thumb)

  • idx_orders_items_gin ≈ 0.3–0.8× table size depending on array width.
  • Functional B-Tree on city ≈ few hundred MB for 10M rows.
  • Partial index tiny (only ‘shipped’ rows).

When this is great

  • You mostly filter on few fields, sometimes query nested arrays, and rarely do large sorts on numeric paths.

Option B — Model hot fields + keep raw

ALTER TABLE orders
  ADD COLUMN status text,
  ADD COLUMN city text,
  ADD COLUMN total numeric;

-- Keep in sync (generated or trigger; generated shown)
ALTER TABLE orders
  ADD COLUMN payload_status text GENERATED ALWAYS AS (payload->>'status') STORED,
  ADD COLUMN payload_city   text GENERATED ALWAYS AS (payload->'shipping'->>'city') STORED,
  ADD COLUMN payload_total  numeric GENERATED ALWAYS AS ((payload->>'total')::numeric) STORED;

-- Make modeled aliases the ones you index/use:
CREATE INDEX ON orders (status);
CREATE INDEX ON orders (city);
CREATE INDEX ON orders (total DESC);
CREATE INDEX ON orders (created_at);

Representative costs

QueryExample SQLPlan ShapeLatency (cold→warm)Notes
Q1 status + timeWHERE status='shipped' AND created_at>=...Bitmap Index Scan on B-Tree(s)20–150ms → 2–8msSmaller/faster than GIN.
Q2 top citiesGROUP BY cityIndex Only Scan + HashAgg60–300ms → 8–20msB-Tree compression helps cache.
Q3 contains SKUpayload->'items' @> ...Seq Scan or add narrow GIN on items1–4s → 200–600msEither add one focused GIN, or accept slower path.
Q4 sort by totalORDER BY total DESC LIMIT 50Index Only Scan on (total)5–40ms → 1–5msB-Tree dominates sorting/range queries.

Storage impact (rules of thumb)

  • Three B-Tree indexes: small compared to one wide GIN.
  • JSONB column still present for audit/debug.

When this is great

  • You routinely filter/sort on a few hot fields and need predictable latency.

How to Measure in Your DB (copy/paste)

-- 1) Baseline
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders
WHERE (payload->>'status') = 'shipped'
  AND created_at >= now() - interval '7 days';

-- 2) After index
CREATE INDEX ...;  -- as chosen above
VACUUM ANALYZE orders;

EXPLAIN (ANALYZE, BUFFERS) ...same query...

-- 3) Size checks
SELECT relname, pg_size_pretty(pg_relation_size(relid))
FROM pg_catalog.pg_statio_user_indexes
WHERE relname LIKE 'idx_orders%';

-- 4) Hot vs cold cache (repeat after pg_prewarm or after restart)

Interpretation tips

  • Favor plans that use Index Only Scans with few heap rechecks.
  • Check Buffers: to see if reads are cached.
  • Beware Rows Removed by Filter → low selectivity.
  • If sorts spill (Disk: ...), model columns or add better indexes.

Best Practices (No Regrets)

  1. Index only what you read
    Create partial GIN for the exact predicate (e.g., status=’shipped’). Don’t GIN the world.
  2. Prefer functional B-Tree for equality/group-by
    ON ((payload->'shipping'->>'city')) is tiny and fast.
  3. Model for ORDER BY / ranges
    Sorting JSONB-extracted numerics is expensive. Use columns + B-Tree.
  4. Use generated columns for safety
    They lock types at write time and reduce per-query casts.
  5. Cap index bloat
    REINDEX CONCURRENTLY, VACUUM (ANALYZE), and monitor pg_stat_all_indexes.idx_scan.
  6. Protect writes
    Each GIN write touches many pages. If ingest throughput matters, keep GIN count low and paths narrow.
  7. Promote fields deliberately
    Start JSONB-first; promote fields to columns once their access pattern stabilizes.

Common Pitfalls

  • Over-broad GIN on whole payload → huge index, poor cache locality.
  • Casting inline every query → defeats index usage (::numeric on the fly).
  • Low-selectivity filters on GIN → index is touched but doesn’t reduce work.
  • FORCE of schema too early → frequent migrations for fields that churn.
  • Ignoring ORDER BY needs → users complain about “slow lists”.

Comparison Table — Choose Quickly

NeedJSONB + GINFunctional B-Tree on JSONBModeled Column
Equality filter on few values
Range filter / ORDER BY⚠️ (if cast to numeric, still meh)
Frequent schema evolution⚠️ (migrations)
Index size❌ (large)✅ (small)✅ (small)
Write throughput
Analytics-friendly types⚠️⚠️

Real Example: Promote What Hurts

Symptoms

  • ORDER BY (payload->>'total')::numeric DESC LIMIT 50 takes 600ms warm.
  • WHERE payload->>'status'='shipped' is fine with a partial index.
  • Writes slowed after full-payload GIN was added.

Fix

  • Drop the full GIN.
  • Add total numeric GENERATED ALWAYS AS ... and index it.
  • Keep narrow partial GIN for item containment query only.

Outcome (typical)

  • The top-50 list drops to ~5ms warm.
  • Ingest TPS rises 10–30% (less GIN write churn).
  • Index size shrinks materially.

Conclusion & Takeaways

  • JSONB is not a license to skip modeling.
  • Use JSONB to absorb change; use columns to accelerate hot paths.
  • GIN is for selective lookups on few paths, not sorting and not “index everything.”
  • Generated columns give you the best of both worlds: flexible raw + typed speed.
  • Measure with EXPLAIN (ANALYZE, BUFFERS) before/after each index—let the plan, not vibes, decide.

Internal Link Ideas (for your site)

  • “GIN vs B-Tree for JSONB: operator classes and when to use jsonb_path_ops
  • “PostgreSQL generated columns for schema hygiene”
  • “From event streams to facts: materializing hot fields in OLTP”
  • “Cost-based tuning: reading BUFFERS output in EXPLAIN ANALYZE”
  • “Partial indexes playbook: keeping indexes tiny and useful”

Image Prompt (for DALL·E / Midjourney)

“A clean, modern data architecture decision tree showing a PostgreSQL table with a JSONB column branching into two paths: (1) modeled typed columns with small B-Tree indexes, (2) JSONB with narrow GIN indexes on specific paths; include icons for query types (filter, sort, group-by), annotate where each index shines; minimalistic, high-contrast, 3D isometric style.”


Tags

#PostgreSQL #JSONB #GIN #DatabaseDesign #DataEngineering #Indexing #Performance #SQL #Scalability #Architecture

Leave a Reply

Your email address will not be published. Required fields are marked *